Title: | Random Forest of Q Trees |
---|---|
Description: | Data-adaptive method for effect heterogeneity analysis or non-linear causal studies in Mendelian randomization and instrumental variables analysis. |
Authors: | Haodong Tian |
Maintainer: | Haodong Tian <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.0 |
Built: | 2024-12-25 05:44:01 UTC |
Source: | https://github.com/HDTian/RFQT |
BootstrapTreeFitting
fits the Q-tree for a bootstrap data of the training set (with user-specific seed) and store the cleaned results. If you do not need results but the original (dirty) Q-tree information, you may use GetTree()
.
BootstrapTreeFitting(seed=1, Odat=odat, Vdat=vdat, honest=FALSE, S=5, rate=0.4, SingleM=FALSE, Qthreshold=3.84, method='DR', SoP=10, howGX='SpecificGX', Halve=FALSE, endsize=1000, const=NA)
BootstrapTreeFitting(seed=1, Odat=odat, Vdat=vdat, honest=FALSE, S=5, rate=0.4, SingleM=FALSE, Qthreshold=3.84, method='DR', SoP=10, howGX='SpecificGX', Halve=FALSE, endsize=1000, const=NA)
seed |
seed value for reproducible and traceable results |
Odat |
a data frame. This is recognized as the trainging data set. |
Vdat |
a data frame. This is recognized as the testing or validation data set. |
honest |
Logical value indicates whether to use the honest estimation style or not. Default value is |
S |
a postive integer indicates the maximum tree depth allowed. Default value is 5 (i.e. the data set is allowed to be splitted at most 5 times). |
rate |
a value ranging from 0 to 1 indicates the proportion of the candidate covariates to be randomly cosidered in each split when construcing the Q-tree. |
SingleM |
Logical value indicates whether to use one single covariate variable at all times. Default value is |
Qthreshold |
A positive value used as the threshold value for the Q statistic. Default value is 3.0. One can set |
method |
a character indicates the stratification method used for constructing a Q-tree. There are currently three methods: the Doubly-ranked method ( |
SoP |
a postive integer (>=2) indicates the size of pre-straum. Only applicable for the doubly-ranked stratification. Default value is 10. |
howGX |
a character indicates the way to calculate the instrument-exposure associations. Two ways are currenly allowed: the instrument-exposure associations estimated seperately at each stratum ( |
Halve |
Logical. Indictes whether to split the node by half:half (i.e. 5:5). Default value is |
endsize |
a positive integer value indicates the minimual size of the node of Q-tree. |
const |
a value indicates the fixed instrument-exposure association value used for constructing the Q-tree. Only applicable for |
The training data Odat
and the testing data Vdat
should be of the same form and be a data frame that can be regonized by the functions. The first four columns must be the individual IDs, the (one-dimensional) instrument, the exposure (should be NA
if not available), and the outcome, respectively. The (high-dimensional) covariates are then follwing. Note that all the functions may be working with incorrect column ordering, but the results are misleading. Do make sure the column variable order is correct.
If simulated data is used, there should be a end column named as true_STE. That is, odat$true_STE
is not NULL
.
If the data does not contain the exposure information, the RFQT can still be fitted, where one should use howGX='const'
with a external instrument-exposure association (usually from independent studies).
BootstrapTreeFitting
returns a list with the following components:
end_node_information |
the end nodes (leaves) information of the fitted Q tree, which contains the end-node index, the MR estimates, and the sample size proportion. |
OOB_predict |
the predicted effects of the OOB samples accrding to the Q-tree. Values will be 0 if the individual is not out of bag. |
v_predict |
the predicted effects of the testing samples accrding to the Q-tree. |
vi1 |
the vector of the variable importance (VI) measurements for all the candidate covariates considered. |
vi2 |
the vector of the variable importance (VI) measurements for all the candidate covariates considered. This VI measurements do not need the individual true effect information, therefore suitable for real application data. |
ts1 |
the value of the permutation test with statistic S1. |
ts2 |
the value of the permutation test with statistic S2. |
When the real data is fitted, vi1
is not applicable. When only traning data is available, v_predict
is not applicable.
Burgess, S., Davies, N. M., & Thompson, S. G. (2014). "Instrumental variable analysis with a nonlinear exposure–outcome relationship". Epidemiology (Cambridge, Mass.), 25(6), 877. (Residual stratification)
Tian, H., Mason, A. M., Liu, C., & Burgess, S. (2023). "Relaxing parametric assumptions for non-linear Mendelian randomization using a doubly-ranked stratification method". PLOS Genetics, 19(6), e1010823. (Doubly-ranked stratification)
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 odat<-res$traning.set #training set vdat<-res$testing.set #testing set #When running RFQT with mutiple Q trees/bootstrap - use parallel computation Nb<-5 #how many trees in the forest? e.g. 5 trees cl<-makeCluster(2)#规定用多少并行clusters/线程 clusterEvalQ(cl=cl , expr=library(dplyr)) clusterEvalQ(cl=cl , expr=library(MendelianRandomization) ) clusterExport( cl=cl , varlist=c( 'odat', 'vdat', 'GetTree', 'GetNindex', 'GetIndex' ) )#or any other arguments RES<-parSapply( cl , 1:Nb, BootstrapTreeFitting ) stopCluster(cl) dim(RES)#7 Nb ##If you wish to use your own parameters rather than the default parameters, try: #general exmaple user_BootstrapTreeFitting<-function(seed){ RES<-BootstrapTreeFitting(seed, honest=my.honest, S=my.S, JJ=my.JJ, #or any partial of the arguments rate=my.rate, Qthreshold=my.Qthreshold, method=my.method, SoP=my.SoP, howGX=my.howGX, endsize=my.endsize) return(RES) } #specific exmaple user_BootstrapTreeFitting<-function(seed){ RES<-BootstrapTreeFitting(seed,SoP=20) return(RES) } ###parallel computation -> RFQT Nb<-5 #how many trees in the forest? cl<-makeCluster(2) clusterEvalQ(cl=cl , expr=library(dplyr)) clusterEvalQ(cl=cl , expr=library(MendelianRandomization) ) clusterExport( cl=cl , varlist=c( 'odat', 'vdat', 'GetTree', 'GetNindex', 'GetIndex' , 'BootstrapTreeFitting') ) RES<-parSapply( cl , 1:Nb, user_BootstrapTreeFitting ) stopCluster(cl)
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 odat<-res$traning.set #training set vdat<-res$testing.set #testing set #When running RFQT with mutiple Q trees/bootstrap - use parallel computation Nb<-5 #how many trees in the forest? e.g. 5 trees cl<-makeCluster(2)#规定用多少并行clusters/线程 clusterEvalQ(cl=cl , expr=library(dplyr)) clusterEvalQ(cl=cl , expr=library(MendelianRandomization) ) clusterExport( cl=cl , varlist=c( 'odat', 'vdat', 'GetTree', 'GetNindex', 'GetIndex' ) )#or any other arguments RES<-parSapply( cl , 1:Nb, BootstrapTreeFitting ) stopCluster(cl) dim(RES)#7 Nb ##If you wish to use your own parameters rather than the default parameters, try: #general exmaple user_BootstrapTreeFitting<-function(seed){ RES<-BootstrapTreeFitting(seed, honest=my.honest, S=my.S, JJ=my.JJ, #or any partial of the arguments rate=my.rate, Qthreshold=my.Qthreshold, method=my.method, SoP=my.SoP, howGX=my.howGX, endsize=my.endsize) return(RES) } #specific exmaple user_BootstrapTreeFitting<-function(seed){ RES<-BootstrapTreeFitting(seed,SoP=20) return(RES) } ###parallel computation -> RFQT Nb<-5 #how many trees in the forest? cl<-makeCluster(2) clusterEvalQ(cl=cl , expr=library(dplyr)) clusterEvalQ(cl=cl , expr=library(MendelianRandomization) ) clusterExport( cl=cl , varlist=c( 'odat', 'vdat', 'GetTree', 'GetNindex', 'GetIndex' , 'BootstrapTreeFitting') ) RES<-parSapply( cl , 1:Nb, user_BootstrapTreeFitting ) stopCluster(cl)
DTfit
fits a data set (the training data) by the classical Q tree, and return the summary results. DTfit
works similarly as GetTree
and BootstrapTreeFitting
, but not designed or used for future random forest fitting. If you wish to fit data via single tree rather than random forest, use DTfit
.
DTfit(Odat = odat, Vdat = NA, honest = FALSE, S = 5, rate =1, SingleM = FALSE, Qthreshold = 3.84, method = "DR", SoP = 10, howGX = "SpecificGX", Halve = FALSE, endsize = 1000, const = NA)
DTfit(Odat = odat, Vdat = NA, honest = FALSE, S = 5, rate =1, SingleM = FALSE, Qthreshold = 3.84, method = "DR", SoP = 10, howGX = "SpecificGX", Halve = FALSE, endsize = 1000, const = NA)
Odat |
a data frame. This is recognized as the trainging data set. |
Vdat |
a data frame. This is recognized as the testing or validation data set. The default value is |
honest |
logical value indicates whether to use the honest estimation style or not. Default value is |
S |
a postive integer indicates the maximum tree depth allowed. Default value is 5 (i.e. the data set is allowed to be splitted at most 5 times). |
rate |
a value ranging from 0 to 1 indicates the proportion of the candidate covariates to be randomly cosidered in each split when construcing the Q-tree. |
SingleM |
logical value indicates whether to use one single covariate variable at all times. Default value is |
Qthreshold |
a positive value used as the threshold value for the Q statistic. Default value is 3.84. One can set |
method |
a character indicates the stratification method used for constructing a Q-tree. There are currently three methods: the Doubly-ranked method ( |
SoP |
a postive integer (>=2) indicates the size of pre-straum. Only applicable for the doubly-ranked stratification. Default value is 10. |
howGX |
a character indicates the way to calculate the instrument-exposure associations. Two ways are currenly allowed: the instrument-exposure associations estimated seperately at each stratum ( |
Halve |
logical. Indictes whether to split the node by half:half (i.e. 5:5). Default value is |
endsize |
a positive integer value indicates the minimual size of the node of Q-tree. |
const |
a value indicates the fixed instrument-exposure association value used for constructing the Q-tree. Only applicable for |
Like RFQTfit
, DTfit
is an integrated function combining other functions, including getPredict
and getVI
.
The training data Odat
and the testing data Vdat
should be of the same form and be a data frame that can be regonized by the functions. The first four columns must be the individual IDs, the (one-dimensional) instrument, the exposure (should be NA
if not available), and the outcome, respectively. The (high-dimensional) covariates are then follwing. Note that all the functions may be working with incorrect column ordering, but the results are misleading. DO make sure the column variable order is correct.
DTfit
returns a list consisting of the following components:
end_node_information |
the end nodes (leaves) information of the fitted Q tree, which contains the end-node index, the MR estimates, and the sample size proportion. |
v_predicted |
the predicted effects of the testing samples accrding to the Q-tree. |
ts1 |
the value of the permutation test with statistic S1. |
ts2 |
the value of the permutation test with statistic S2. |
MSE |
the MSE results for the testing samples (if both of the testing set and the label exist). |
rdat |
the tree fitted data. |
Burgess, S., Davies, N. M., & Thompson, S. G. (2014). "Instrumental variable analysis with a nonlinear exposure–outcome relationship". Epidemiology (Cambridge, Mass.), 25(6), 877. (Residual stratification)
Tian, H., Mason, A. M., Liu, C., & Burgess, S. (2023). "Relaxing parametric assumptions for non-linear Mendelian randomization using a doubly-ranked stratification method". PLOS Genetics, 19(6), e1010823. (Doubly-ranked stratification)
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the default setting: scenario='A' and SoM=0.5 odat<-res$traning.set #training set vdat<-res$testing.set #testing set DTRES<-DTfit(odat,vdat) DTRES #get (testing set) MSE manually mean( ( DTRES$v_predict - vdat$true_STE )^2 )
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the default setting: scenario='A' and SoM=0.5 odat<-res$traning.set #training set vdat<-res$testing.set #testing set DTRES<-DTfit(odat,vdat) DTRES #get (testing set) MSE manually mean( ( DTRES$v_predict - vdat$true_STE )^2 )
getDat
creats a toy data based on a certain model
getDat(N = 150000, Nt = 100000, Nc = 20, scenario='A', SoM = 0.5, ZXeffect = 0.5, Random = TRUE, label = TRUE, split = TRUE)
getDat(N = 150000, Nt = 100000, Nc = 20, scenario='A', SoM = 0.5, ZXeffect = 0.5, Random = TRUE, label = TRUE, split = TRUE)
N |
a integer indicates the total sample size (training data size + testing data size) |
Nt |
a integer indicates the size of the training data. |
Nc |
a integer indicates the number of candidate variables. |
scenario |
a character indicates the model scenarios data will be simulated from. |
SoM |
a value indicates the strength of modification (i.e. how strong will the covariate modify the treatment effect). Note even if |
ZXeffect |
a value indicates the instrument-exposure effect. |
Random |
logical. Indicates whether to use the random value for strenght modification. Default value is |
label |
logical. Indicates whether to added the final column of the simulated data as the true heterogenerous effect (i.e. label). Default value is |
split |
logical. If |
The data-generating model is the same as the Scenarios in the original paper, where the first 5 covariates are the true effect modifiers. The covariates in even positions (2,4,...) are the common downstream variables of the exposure and te confounders, therefore causing collider bias (when conditioning on these variables).
getDat()
returns a list with three kings of toy data set
whole.data |
the complete whole dataset |
traning.set |
the training set |
testing.set |
the testing set |
SoM |
Strength of Modification |
modifier_vec |
the modification strength vector for each covariate |
Scenario |
scenario information |
the training set and the testing set are of the same form that can be regonized by all the functions. The first four columns are the individual IDs, the instrument, the exposure, and the outcome, respectively. The high-dimensional (dimensions = Nc
) covariates are then follwing. The end column is true_STE
, representing the individal controlled direct treatment effect (see more details in the original paper).
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 odat<-res$traning.set #training set vdat<-res$testing.set #testing set
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 odat<-res$traning.set #training set vdat<-res$testing.set #testing set
GetIndex
gives the best covariate index according to the inputed data set. The best covariate is one out of the candidate covariates considered that gives the largest Q statistic value.
GetIndex(dat_current, JJ, rate = 1, SpecificM = NA, method = "DR", SoP = 10, howGX = "SpecificGX", Halve = FALSE, const = NA)
GetIndex(dat_current, JJ, rate = 1, SpecificM = NA, method = "DR", SoP = 10, howGX = "SpecificGX", Halve = FALSE, const = NA)
dat_current |
A data frame. This data frame has the first four columns: individual IDs, the instrument, the expoure, and the outcome with the candidate covariates following. |
JJ |
A positive integer indicates the total number of the candiate covariates. |
rate |
a value ranging from 0 to 1 indicates the proportion of the candidate covariates to be randomly cosidered in each split when construcing the Q-tree. |
SpecificM |
a vector indicates the index of candidate variables that can be considered in spliiting. Tbe default value is |
method |
a character indicates the stratification method used for constructing a Q-tree. There are currently three methods: the Doubly-ranked method ( |
SoP |
a postive integer (>=2) indicates the size of pre-straum. Only applicable for the doubly-ranked stratification. Default value is 10. |
howGX |
a character indicates the way to calculate the instrument-exposure associations. Two ways are currenly allowed: the instrument-exposure associations estimated seperately at each stratum ( |
Halve |
Logical. Indictes whether to split the node by half:half (i.e. 5:5). Default value is |
const |
a value indicates the fixed instrument-exposure association value used for constructing the Q-tree. Only applicable for |
GetIndex
helps to decide the splitting covariate at the present node when constructing a Q-tree.
GetIndex
returns a vector with the following three values:'Candidate.index', 'Q.value' , 'node.size'
Candidate.index |
The index (position rank) of the chosen covariate. |
Q.value |
The corresponding Q statistic value of the chosen covariate. |
minimal.node.size.after.split |
The mimimal sample size of the sun-nodes after splitting. |
split.style |
Three values denoted by 1,2,3 representing the splitting style used for the final covariate. They are 3:7, 5:5, 7:3, respectively. |
Burgess, S., Davies, N. M., & Thompson, S. G. (2014). "Instrumental variable analysis with a nonlinear exposure–outcome relationship". Epidemiology (Cambridge, Mass.), 25(6), 877. (Residual stratification)
Tian, H., Mason, A. M., Liu, C., & Burgess, S. (2022). "Relaxing parametric assumptions for non-linear Mendelian randomization using a doubly-ranked stratification method". bioRxiv, 2022-06. (Doubly-ranked stratification)
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 odat<-res$traning.set #training set vdat<-res$testing.set #testing set GetIndex(odat,JJ=20)
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 odat<-res$traning.set #training set vdat<-res$testing.set #testing set GetIndex(odat,JJ=20)
getMSE
calculates the MSE value for RFQT (or a Q-tree if RFQT only contains one single tree)
getMSE(RES, indicator = 1 )
getMSE(RES, indicator = 1 )
RES |
a fitting list result from |
indicator |
numeric indicator to judge which type of MSE should be calculated. |
getMSE
returns a vector of MSE with the increase of the number of Q-trees. The stable MSE values should indicate an appropriate number of Q-trees.
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 res<-getDat(label=FALSE) odat<-res$traning.set #training set vdat<-res$testing.set #testing set ##When running RFQT with mutiple Q trees/bootstrap - use parallel computation Nb<-7 #how many trees in the forest? e.g. 5 trees cl<-makeCluster(2) clusterEvalQ(cl=cl , expr=library(dplyr)) clusterEvalQ(cl=cl , expr=library(MendelianRandomization) ) clusterExport( cl=cl , varlist=c('odat','vdat', 'GetTree', 'GetNindex', 'GetIndex' ) ) RES<-parSapply( cl , 1:Nb, BootstrapTreeFitting ) stopCluster(cl) dim(RES)#7 Nb getMSE(RES,indicator=1 ) getMSE(RES,indicator=2 )
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 res<-getDat(label=FALSE) odat<-res$traning.set #training set vdat<-res$testing.set #testing set ##When running RFQT with mutiple Q trees/bootstrap - use parallel computation Nb<-7 #how many trees in the forest? e.g. 5 trees cl<-makeCluster(2) clusterEvalQ(cl=cl , expr=library(dplyr)) clusterEvalQ(cl=cl , expr=library(MendelianRandomization) ) clusterExport( cl=cl , varlist=c('odat','vdat', 'GetTree', 'GetNindex', 'GetIndex' ) ) RES<-parSapply( cl , 1:Nb, BootstrapTreeFitting ) stopCluster(cl) dim(RES)#7 Nb getMSE(RES,indicator=1 ) getMSE(RES,indicator=2 )
GetNindex
helps to match the Q-tree end node for any inputed samples, and therefore enables to make further analysis like effect prediction for these inputed samples.
GetNindex(M, rdat, S = NA)
GetNindex(M, rdat, S = NA)
M |
A matrix or data frame contains the candidate covariate information. |
rdat |
A data frame. It is the Q-tree informaton result and obtained by |
S |
A positive integer indicates the max depth of each sample will explore along the tree. This number should be consistent with the max depth of the fitted Q tree (also refected in |
The inputed data set M
must contain the same candidate covariate information (with the same order) as the training data for the Q-tree. In addition, the first four columns (individual IDs, the instrument, the exposure, the outcome) need to be removed, so that the fist column starts from the first candidate covariate.
GetNindex
returns a vector with the same length of the row number of the inputed data M
. Each element represents the end node (i.e. leaf) index according to the reference Q-tree (the tree information was stored within rdat
).
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 odat<-res$traning.set #training set vdat<-res$testing.set #testing set rdat<-GetTree(odat) vdat_Nindex<-GetNindex(vdat[,5:24] ,rdat ) #the M colnumber and order should be same as the training data #may use another independent data (estimation data) to #calculate the endnode-specific IV/MR estimates #for example: #let odat halves into trdata (tree data) and estdata (estimation data); #also we can have rdat<-GetTree(trdata) trdat<-odat[1: (nrow(odat)/2) ,]#tree data estdat<-odat[(nrow(odat)/2+1):nrow(odat) ,] #estimaiton data rdat<-GetTree(trdat)#fitted Q tree estdat$Nindex<-GetNindex(estdat[,5:24] ,rdat ) #rdat contains the tree information (i.e. decision rule)
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 odat<-res$traning.set #training set vdat<-res$testing.set #testing set rdat<-GetTree(odat) vdat_Nindex<-GetNindex(vdat[,5:24] ,rdat ) #the M colnumber and order should be same as the training data #may use another independent data (estimation data) to #calculate the endnode-specific IV/MR estimates #for example: #let odat halves into trdata (tree data) and estdata (estimation data); #also we can have rdat<-GetTree(trdata) trdat<-odat[1: (nrow(odat)/2) ,]#tree data estdat<-odat[(nrow(odat)/2+1):nrow(odat) ,] #estimaiton data rdat<-GetTree(trdat)#fitted Q tree estdat$Nindex<-GetNindex(estdat[,5:24] ,rdat ) #rdat contains the tree information (i.e. decision rule)
getPars
returns the information of the (hyper-)parameters relevant to the Q-tree and RFQT fitting.
getPars(empty.argument)
getPars(empty.argument)
empty.argument |
No arguments are needed. |
getPars()
returns the following information: the total number of candidate covariates, the training data size, the testing data siz, the stratification method, the size of pre-stratum (if applicable), the proportion of the covariates randomly considered in each node split, the maximal tree depth allowed, the way to calculate the instrument-exposure associations, the fixed instrument-exposure association level (if applicable), the minimal node size, the threshold value of the Q statistic.
If the variable is not defined, it will return 'Not defined' with a value in brackets representing the default value for this variable when fitting RFQT.
Note that only the variable defined in the global enviroment will be considered. Even if one defines these variables, the functions like BootstrapTreeFitting
and RFQTfit
will use the default value if (s)he does not modify the function arguments.
See the original paper for the details of each parameter information.
Haodong Tian
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) res<-getDat() odat<-res$traning.set #training set vdat<-res$testing.set #testing set method<-'DR'; SoP<-20; rate<-2/5 getPars()
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) res<-getDat() odat<-res$traning.set #training set vdat<-res$testing.set #testing set method<-'DR'; SoP<-20; rate<-2/5 getPars()
getPredict
calculates the predicted individual treatment effect estimates for RFQT, either for the OOB data or the testing data.
getPredict(RES, indicator = 1 )
getPredict(RES, indicator = 1 )
RES |
a fitting list result from |
indicator |
numeric indicator to judge which type of predicted effect should be presented. |
See the relevant part of the original paper for more details of how to predict an individual effect according to a fitted Q tree (therefore a RFQT).
getPredict
returns a matrix where the row corresponds to the sample size (either the OOB data or the testing data, depending on the indicator used) and the column coresponds to the number of Q-trees. Each column represents the individual predicted effects with the present number of Q trees.
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 res<-getDat(label=FALSE) odat<-res$traning.set #training set vdat<-res$testing.set #testing set ##When running RFQT with mutiple Q trees/bootstrap - use parallel computation Nb<-7 #how many trees in the forest? e.g. 5 trees cl<-makeCluster(2) clusterEvalQ(cl=cl , expr=library(dplyr)) clusterEvalQ(cl=cl , expr=library(MendelianRandomization) ) clusterExport( cl=cl , varlist=c('odat','vdat', 'GetTree', 'GetNindex', 'GetIndex' ) ) RES<-parSapply( cl , 1:Nb, BootstrapTreeFitting ) stopCluster(cl) dim(RES)#7 Nb predict_matrix<-getPredict(RES,indicator=1 ) predict_matrix<-getPredict(RES,indicator=2 ) dim(predict_matrix) ###View(predict_matrix)
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 res<-getDat(label=FALSE) odat<-res$traning.set #training set vdat<-res$testing.set #testing set ##When running RFQT with mutiple Q trees/bootstrap - use parallel computation Nb<-7 #how many trees in the forest? e.g. 5 trees cl<-makeCluster(2) clusterEvalQ(cl=cl , expr=library(dplyr)) clusterEvalQ(cl=cl , expr=library(MendelianRandomization) ) clusterExport( cl=cl , varlist=c('odat','vdat', 'GetTree', 'GetNindex', 'GetIndex' ) ) RES<-parSapply( cl , 1:Nb, BootstrapTreeFitting ) stopCluster(cl) dim(RES)#7 Nb predict_matrix<-getPredict(RES,indicator=1 ) predict_matrix<-getPredict(RES,indicator=2 ) dim(predict_matrix) ###View(predict_matrix)
GetTree
fits a data and construct a Q-tree. The fitted data is directly used to build tree (so no bootstrap). The tree information will be stored but not cleaned for further analysis. You may need BootstrapTreeFitting
rather than this function.
GetTree(dat, S = 5, Qthreshold = 3.84, rate = 1, SpecificM = NA, method = "DR", SoP = 10, howGX = "SpecificGX", Halve = FALSE, const = NA, endsize = 1000)
GetTree(dat, S = 5, Qthreshold = 3.84, rate = 1, SpecificM = NA, method = "DR", SoP = 10, howGX = "SpecificGX", Halve = FALSE, const = NA, endsize = 1000)
dat |
a data frame. This data (without any bootstrap) will be fitted to construct a Q-tree. |
S |
a postive integer indicates the maximum tree depth allowed. Default value is 5 (i.e. the data set is allowed to be splitted at most 5 times). S is also conneted to the argument endsize, both of which work in a similar way. One may only restrict one of them and leave another one be a flexible value (e.g. |
Qthreshold |
A positive value used as the threshold value for the Q statistic. Default value is 3.0. |
rate |
a value ranging from 0 to 1 indicates the proportion of the candidate covariates to be randomly cosidered in each split when construcing the Q-tree. When |
SpecificM |
a vector indicates the index of candidate variables that can be considered in spliiting. Tbe default value is |
method |
a character indicates the stratification method used for constructing a Q-tree. There are currently three methods: the Doubly-ranked method ( |
SoP |
a postive integer (>=2) indicates the size of pre-straum. Only applicable for the doubly-ranked stratification. Default value is 10. |
howGX |
a character indicates the way to calculate the instrument-exposure associations. Two ways are currenly allowed: the instrument-exposure associations estimated seperately at each stratum ( |
Halve |
Logical. Indictes whether to split the node by half:half (i.e. 5:5). Default value is |
const |
a value indicates the fixed instrument-exposure association value used for constructing the Q-tree. Only applicable for |
endsize |
a positive integer value indicates the minimual size of the node of Q-tree allowed. One can also control the endsize via |
dat
must be a data frame that can be regonized by the functions. The first four columns must be the individual IDs, the (one-dimensional) instrument, the exposure (should be NA
if not available), and the outcome, respectively. The (high-dimensional) covariates are then follwing. Note that all the functions may be working with incorrect column ordering, but the results are misleading. DO make sure the column variable order is correct.
If simulated data is used, there should be a end column named as true_STE. That is, dat$true_STE
is not NULL
.
GetTree
returns a data set with the same row dimension as the inputed data dat
. The returned data is 'augmented' with tree information including the tree end node index, and the splitting information in each split.
Burgess, S., Davies, N. M., & Thompson, S. G. (2014). "Instrumental variable analysis with a nonlinear exposure–outcome relationship". Epidemiology (Cambridge, Mass.), 25(6), 877. (Residual stratification)
Tian, H., Mason, A. M., Liu, C., & Burgess, S. (2023). "Relaxing parametric assumptions for non-linear Mendelian randomization using a doubly-ranked stratification method". PLOS Genetics, 19(6), e1010823. (Doubly-ranked stratification)
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) res<-getDat() odat<-res$traning.set #training set tree_result<-GetTree(odat)
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) res<-getDat() odat<-res$traning.set #training set tree_result<-GetTree(odat)
getVI
calculates the Varibale Importance (VI) measurements for a random forest of Q-trees (RFQT).
getVI(RES, VItype = 2 )
getVI(RES, VItype = 2 )
RES |
a fitting list result from |
VItype |
numeric indicator to judge which type of VI measurement should be calculated. |
See the VI measurement part of the original paper for more detailed VI introduction and the algorithm.
getVI
retuns a vector indicating the variable importacne (VI) measurements for all the candidate covariates (with the same ordering of the candidate covariates to the original fitting data).
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 res<-getDat(label=FALSE) odat<-res$traning.set #training set vdat<-res$testing.set #testing set ##When running RFQT with mutiple Q trees/bootstrap - use parallel computation Nb<-7 #how many trees in the forest? e.g. 5 trees cl<-makeCluster(2) clusterEvalQ(cl=cl , expr=library(dplyr)) clusterEvalQ(cl=cl , expr=library(MendelianRandomization) ) clusterExport( cl=cl , varlist=c('odat','vdat', 'GetTree', 'GetNindex', 'GetIndex' ) ) RES<-parSapply( cl , 1:Nb, BootstrapTreeFitting ) stopCluster(cl) dim(RES)#7 Nb getVI(RES,VItype=1 ) getVI(RES,VItype=2 )
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 res<-getDat(label=FALSE) odat<-res$traning.set #training set vdat<-res$testing.set #testing set ##When running RFQT with mutiple Q trees/bootstrap - use parallel computation Nb<-7 #how many trees in the forest? e.g. 5 trees cl<-makeCluster(2) clusterEvalQ(cl=cl , expr=library(dplyr)) clusterEvalQ(cl=cl , expr=library(MendelianRandomization) ) clusterExport( cl=cl , varlist=c('odat','vdat', 'GetTree', 'GetNindex', 'GetIndex' ) ) RES<-parSapply( cl , 1:Nb, BootstrapTreeFitting ) stopCluster(cl) dim(RES)#7 Nb getVI(RES,VItype=1 ) getVI(RES,VItype=2 )
RFQTfit
fits a data set (the training data) by the random forest of Q trees (RFQT), and return the summary results.
RFQTfit(odat, vdat = NA, Nb = 5, S = 5, honest = FALSE, rate = 0.4, SingleM = FALSE, Qthreshold = 3.84, method = "DR", SoP = 10, howGX = "SpecificGX", Halve = FALSE, endsize = 1000, const = NA, Cores = NA, trackfile=NA )
RFQTfit(odat, vdat = NA, Nb = 5, S = 5, honest = FALSE, rate = 0.4, SingleM = FALSE, Qthreshold = 3.84, method = "DR", SoP = 10, howGX = "SpecificGX", Halve = FALSE, endsize = 1000, const = NA, Cores = NA, trackfile=NA )
odat |
a data frame. This is recognized as the trainging data set. |
vdat |
a data frame. This is recognized as the testing or validation data set. The default value is |
Nb |
a positive integer indicates the number of bootstrap (i.e. the number of Q-trees) |
S |
a postive integer indicates the maximum tree depth allowed. Default value is 5 (i.e. the data set is allowed to be splitted at most 5 times). |
honest |
logical value indicates whether to use the honest estimation style or not. Default value is |
rate |
a value ranging from 0 to 1 indicates the proportion of the candidate covariates to be randomly cosidered in each split when construcing the Q-tree. |
SingleM |
logical value indicates whether to use one single covariate variable at all times. Default value is |
Qthreshold |
a positive value used as the threshold value for the Q statistic. Default value is 3.84. One can set |
method |
a character indicates the stratification method used for constructing a Q-tree. There are currently three methods: the Doubly-ranked method ( |
SoP |
a postive integer (>=2) indicates the size of pre-straum. Only applicable for the doubly-ranked stratification. Default value is 10. |
howGX |
a character indicates the way to calculate the instrument-exposure associations. Two ways are currenly allowed: the instrument-exposure associations estimated seperately at each stratum ( |
Halve |
logical. Indictes whether to split the node by half:half (i.e. 5:5). Default value is |
endsize |
a positive integer value indicates the minimual size of the node of Q-tree. |
const |
a value indicates the fixed instrument-exposure association value used for constructing the Q-tree. Only applicable for |
Cores |
a positive integer indicates the number of cores should be used when running random forest in parallel. Default value is the number of CPU cores minus one. You can check the number of CPU cores via |
trackfile |
the path string indicates the file to store the running information in parallel. |
RFQTfit
is an integrated function combining other functions, including GetTree
, BootstrapTreeFitting
, getPredict
, getVI
, etc. The fitting automatically uses parallel computation.
The training data Odat
and the testing data Vdat
should be of the same form and be a data frame that can be regonized by the functions. The first four columns must be the individual IDs, the (one-dimensional) instrument, the exposure (should be NA
if not available), and the outcome, respectively. The (high-dimensional) covariates are then follwing. Note that all the functions may be working with incorrect column ordering, but the results are misleading. DO make sure the column variable order is correct.
RFQTfit
shows all the parameter information used for random forest fitting in the begining, and returns in the end a list consisting of the following components:
RES |
The direct fitting results of each Q-tree, which contains |
MSE_OOB |
the |
MSE_test |
the |
Predict_OOB |
the matrix of predicted effects for OOB samples, where the row corresponds to the sample size (the testing data) and the column coresponds to the number of Q-trees. Each column represents the individual predicted effects with the present number of Q trees. If one determine the number of Q-trees to be |
Predict_test |
the matrix of predicted effects for testing set, where the row corresponds to the sample size (the testing data) and the column coresponds to the number of Q-trees. Each column represents the individual predicted effects with the present number of Q trees. If one determine the number of Q-trees to be |
VI1 |
the vector indicates the type I (i.e. with labels) variable importacne (VI) measurements for all the candidate covariates (with the same ordering of the candidate covariates to the original fitting data). |
VI2 |
the vector indicates the type II (i.e. without labels) variable importacne (VI) measurements for all the candidate covariates (with the same ordering of the candidate covariates to the original fitting data). |
TS |
the vector of the permutation test statistic values. The two values represent the values of TS1 and TS2, respectively. |
Burgess, S., Davies, N. M., & Thompson, S. G. (2014). "Instrumental variable analysis with a nonlinear exposure–outcome relationship". Epidemiology (Cambridge, Mass.), 25(6), 877. (Residual stratification)
Tian, H., Mason, A. M., Liu, C., & Burgess, S. (2023). "Relaxing parametric assumptions for non-linear Mendelian randomization using a doubly-ranked stratification method". PLOS Genetics, 19(6), e1010823. (Doubly-ranked stratification)
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 odat<-res$traning.set vdat<-res$testing.set ###try: ALLRES<-RFQTfit(odat) ###try: ALLRES<-RFQTfit(odat,vdat,Nb=200)
library(MendelianRandomization) library(tidyverse) library(data.table) library(parallel) set.seed(60) res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5 odat<-res$traning.set vdat<-res$testing.set ###try: ALLRES<-RFQTfit(odat) ###try: ALLRES<-RFQTfit(odat,vdat,Nb=200)