Package 'RFQT' reference manual

Title:	Random Forest of Q Trees
Description:	Data-adaptive method for effect heterogeneity analysis or non-linear causal studies in Mendelian randomization and instrumental variables analysis.
Authors:	Haodong Tian
Maintainer:	Haodong Tian <[email protected]>
License:	GPL (>= 2)
Version:	0.1.0
Built:	2025-02-23 06:04:07 UTC
Source:	https://github.com/HDTian/RFQT

One Bootstrap Q Tree Fitting

Description

BootstrapTreeFitting fits the Q-tree for a bootstrap data of the training set (with user-specific seed) and store the cleaned results. If you do not need results but the original (dirty) Q-tree information, you may use GetTree().

Usage

BootstrapTreeFitting(seed=1,
                     Odat=odat,
                     Vdat=vdat,
                     honest=FALSE,
                     S=5,
                     rate=0.4,
                     SingleM=FALSE,
                     Qthreshold=3.84,
                     method='DR',
                     SoP=10,
                     howGX='SpecificGX',
                     Halve=FALSE,
                     endsize=1000,
                     const=NA)
BootstrapTreeFitting(seed=1,
                     Odat=odat,
                     Vdat=vdat,
                     honest=FALSE,
                     S=5,
                     rate=0.4,
                     SingleM=FALSE,
                     Qthreshold=3.84,
                     method='DR',
                     SoP=10,
                     howGX='SpecificGX',
                     Halve=FALSE,
                     endsize=1000,
                     const=NA)

Arguments

`seed`	seed value for reproducible and traceable results
`Odat`	a data frame. This is recognized as the trainging data set.
`Vdat`	a data frame. This is recognized as the testing or validation data set.
`honest`	Logical value indicates whether to use the honest estimation style or not. Default value is `FALSE`. When honest estimation is used, the tree construction and the leaf-specific estimates will be obtained in two seperate samples of the (boostrapped) training data.
`S`	a postive integer indicates the maximum tree depth allowed. Default value is 5 (i.e. the data set is allowed to be splitted at most 5 times). `S` is also conneted to the argument `endsize`, both of which work in a similar way. One may only restrict one of them and leave another one be a flexible value (e.g. `S=100` or `endsize=0`).
`rate`	a value ranging from 0 to 1 indicates the proportion of the candidate covariates to be randomly cosidered in each split when construcing the Q-tree.
`SingleM`	Logical value indicates whether to use one single covariate variable at all times. Default value is `FALSE`. If `SingleM=TRUE`, the covariate used are the best variable indicated by `GetIndex`.
`Qthreshold`	A positive value used as the threshold value for the Q statistic. Default value is 3.0. One can set `Qthreshold=0` to ignore the Q-related stopping rule.
`method`	a character indicates the stratification method used for constructing a Q-tree. There are currently three methods: the Doubly-ranked method (`method='DR'`), the residual methods (`method='Residual'`) and the naive method (recognized by any character except `'DR'` and `'Residual'`). The default method is `'DR'`.
`SoP`	a postive integer (>=2) indicates the size of pre-straum. Only applicable for the doubly-ranked stratification. Default value is 10.
`howGX`	a character indicates the way to calculate the instrument-exposure associations. Two ways are currenly allowed: the instrument-exposure associations estimated seperately at each stratum (`howGX='SpecificGX'`), or use a fixed instrument-exposure association (`howGX='const'`)
`Halve`	Logical. Indictes whether to split the node by half:half (i.e. 5:5). Default value is `FALSE`.
`endsize`	a positive integer value indicates the minimual size of the node of Q-tree. `S` and `endsize` work in a similar way.
`const`	a value indicates the fixed instrument-exposure association value used for constructing the Q-tree. Only applicable for `howGX='SpecificGX'`.

Details

The training data Odat and the testing data Vdat should be of the same form and be a data frame that can be regonized by the functions. The first four columns must be the individual IDs, the (one-dimensional) instrument, the exposure (should be NA if not available), and the outcome, respectively. The (high-dimensional) covariates are then follwing. Note that all the functions may be working with incorrect column ordering, but the results are misleading. Do make sure the column variable order is correct.

If simulated data is used, there should be a end column named as true_STE. That is, odat$true_STE is not NULL.

If the data does not contain the exposure information, the RFQT can still be fitted, where one should use howGX='const' with a external instrument-exposure association (usually from independent studies).

Value

BootstrapTreeFitting returns a list with the following components:

`end_node_information`	the end nodes (leaves) information of the fitted Q tree, which contains the end-node index, the MR estimates, and the sample size proportion.
`OOB_predict`	the predicted effects of the OOB samples accrding to the Q-tree. Values will be 0 if the individual is not out of bag.
`v_predict`	the predicted effects of the testing samples accrding to the Q-tree.
`vi1`	the vector of the variable importance (VI) measurements for all the candidate covariates considered.
`vi2`	the vector of the variable importance (VI) measurements for all the candidate covariates considered. This VI measurements do not need the individual true effect information, therefore suitable for real application data.
`ts1`	the value of the permutation test with statistic S1.
`ts2`	the value of the permutation test with statistic S2.

When the real data is fitted, vi1 is not applicable. When only traning data is available, v_predict is not applicable.

References

Burgess, S., Davies, N. M., & Thompson, S. G. (2014). "Instrumental variable analysis with a nonlinear exposure–outcome relationship". Epidemiology (Cambridge, Mass.), 25(6), 877. (Residual stratification)

Tian, H., Mason, A. M., Liu, C., & Burgess, S. (2023). "Relaxing parametric assumptions for non-linear Mendelian randomization using a doubly-ranked stratification method". PLOS Genetics, 19(6), e1010823. (Doubly-ranked stratification)

Examples

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the deflaut setting: scenario='A' and SoM=0.5
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set


#When running RFQT with mutiple Q trees/bootstrap - use parallel computation
Nb<-5 #how many trees in the forest? e.g. 5 trees
cl<-makeCluster(2)#规定用多少并行clusters/线程
clusterEvalQ(cl=cl , expr=library(dplyr))  
clusterEvalQ(cl=cl , expr=library(MendelianRandomization) )
clusterExport(  cl=cl ,  varlist=c( 'odat', 'vdat',
                                    'GetTree', 'GetNindex', 'GetIndex' )  )#or any other arguments
RES<-parSapply(   cl ,  1:Nb, BootstrapTreeFitting  ) 
stopCluster(cl)
dim(RES)#7 Nb
##If you wish to use your own parameters rather than the default parameters, try:
#general exmaple
user_BootstrapTreeFitting<-function(seed){
  RES<-BootstrapTreeFitting(seed,
                            honest=my.honest,
                            S=my.S,
                            JJ=my.JJ,   #or any partial of the arguments
                            rate=my.rate,
                            Qthreshold=my.Qthreshold,
                            method=my.method,
                            SoP=my.SoP,
                            howGX=my.howGX,
                            endsize=my.endsize)
  return(RES)
}
#specific exmaple
user_BootstrapTreeFitting<-function(seed){
  RES<-BootstrapTreeFitting(seed,SoP=20)
  return(RES)
}
###parallel computation  -> RFQT
Nb<-5 #how many trees in the forest?
cl<-makeCluster(2)
clusterEvalQ(cl=cl , expr=library(dplyr)) 
clusterEvalQ(cl=cl , expr=library(MendelianRandomization) )
clusterExport(  cl=cl ,  varlist=c( 'odat', 'vdat',
                                    'GetTree', 'GetNindex', 'GetIndex' , 'BootstrapTreeFitting')  )
RES<-parSapply(   cl ,  1:Nb, user_BootstrapTreeFitting  )
stopCluster(cl)

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the deflaut setting: scenario='A' and SoM=0.5
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set


#When running RFQT with mutiple Q trees/bootstrap - use parallel computation
Nb<-5 #how many trees in the forest? e.g. 5 trees
cl<-makeCluster(2)#规定用多少并行clusters/线程
clusterEvalQ(cl=cl , expr=library(dplyr))  
clusterEvalQ(cl=cl , expr=library(MendelianRandomization) )
clusterExport(  cl=cl ,  varlist=c( 'odat', 'vdat',
                                    'GetTree', 'GetNindex', 'GetIndex' )  )#or any other arguments
RES<-parSapply(   cl ,  1:Nb, BootstrapTreeFitting  ) 
stopCluster(cl)
dim(RES)#7 Nb
##If you wish to use your own parameters rather than the default parameters, try:
#general exmaple
user_BootstrapTreeFitting<-function(seed){
  RES<-BootstrapTreeFitting(seed,
                            honest=my.honest,
                            S=my.S,
                            JJ=my.JJ,   #or any partial of the arguments
                            rate=my.rate,
                            Qthreshold=my.Qthreshold,
                            method=my.method,
                            SoP=my.SoP,
                            howGX=my.howGX,
                            endsize=my.endsize)
  return(RES)
}
#specific exmaple
user_BootstrapTreeFitting<-function(seed){
  RES<-BootstrapTreeFitting(seed,SoP=20)
  return(RES)
}
###parallel computation  -> RFQT
Nb<-5 #how many trees in the forest?
cl<-makeCluster(2)
clusterEvalQ(cl=cl , expr=library(dplyr)) 
clusterEvalQ(cl=cl , expr=library(MendelianRandomization) )
clusterExport(  cl=cl ,  varlist=c( 'odat', 'vdat',
                                    'GetTree', 'GetNindex', 'GetIndex' , 'BootstrapTreeFitting')  )
RES<-parSapply(   cl ,  1:Nb, user_BootstrapTreeFitting  )
stopCluster(cl)

Fit a Classical Single Decision Tree (DT)

Description

DTfit fits a data set (the training data) by the classical Q tree, and return the summary results. DTfit works similarly as GetTree and BootstrapTreeFitting, but not designed or used for future random forest fitting. If you wish to fit data via single tree rather than random forest, use DTfit.

Usage

DTfit(Odat = odat, 
      Vdat = NA, 
      honest = FALSE, 
      S = 5, 
      rate =1, 
      SingleM = FALSE, 
      Qthreshold = 3.84, 
      method = "DR",
      SoP = 10, 
      howGX = "SpecificGX", 
      Halve = FALSE, 
      endsize = 1000, 
      const = NA)
DTfit(Odat = odat, 
      Vdat = NA, 
      honest = FALSE, 
      S = 5, 
      rate =1, 
      SingleM = FALSE, 
      Qthreshold = 3.84, 
      method = "DR",
      SoP = 10, 
      howGX = "SpecificGX", 
      Halve = FALSE, 
      endsize = 1000, 
      const = NA)

Arguments

`Odat`	a data frame. This is recognized as the trainging data set.
`Vdat`	a data frame. This is recognized as the testing or validation data set. The default value is `NA`, indicating that no testing data will be considered.
`honest`	logical value indicates whether to use the honest estimation style or not. Default value is `FALSE`. When honest estimation is used, the tree construction and the leaf-specific estimates will be obtained in two seperate samples of the (boostrapped) training data.
`S`	a postive integer indicates the maximum tree depth allowed. Default value is 5 (i.e. the data set is allowed to be splitted at most 5 times). `S` is also conneted to the argument `endsize`, both of which work in a similar way. One may only restrict one of them and leave another one be a flexible value (e.g. `S=100` or `endsize=0`).
`rate`	a value ranging from 0 to 1 indicates the proportion of the candidate covariates to be randomly cosidered in each split when construcing the Q-tree.
`SingleM`	logical value indicates whether to use one single covariate variable at all times. Default value is `FALSE`. If `SingleM=TRUE`, the covariate used are the best variable indicated by `GetIndex`.
`Qthreshold`	a positive value used as the threshold value for the Q statistic. Default value is 3.84. One can set `Qthreshold=0` to ignore the Q-related stopping rule.
`method`	a character indicates the stratification method used for constructing a Q-tree. There are currently three methods: the Doubly-ranked method (`method='DR'`), the residual methods (`method='Residual'`) and the naive method (recognized by any character except `'DR'` and `'Residual'`). The default method is `'DR'`.
`SoP`	a postive integer (>=2) indicates the size of pre-straum. Only applicable for the doubly-ranked stratification. Default value is 10.
`howGX`	a character indicates the way to calculate the instrument-exposure associations. Two ways are currenly allowed: the instrument-exposure associations estimated seperately at each stratum (`howGX='SpecificGX'`), or use a fixed instrument-exposure association (`howGX='const'`).
`Halve`	logical. Indictes whether to split the node by half:half (i.e. 5:5). Default value is `FALSE`.
`endsize`	a positive integer value indicates the minimual size of the node of Q-tree. `S` and `endsize` work in a similar way.
`const`	a value indicates the fixed instrument-exposure association value used for constructing the Q-tree. Only applicable for `howGX='SpecificGX'`.

Details

Like RFQTfit, DTfit is an integrated function combining other functions, including getPredict and getVI.

Value

DTfit returns a list consisting of the following components:

`end_node_information`	the end nodes (leaves) information of the fitted Q tree, which contains the end-node index, the MR estimates, and the sample size proportion.
`v_predicted`	the predicted effects of the testing samples accrding to the Q-tree.
`ts1`	the value of the permutation test with statistic S1.
`ts2`	the value of the permutation test with statistic S2.
`MSE`	the MSE results for the testing samples (if both of the testing set and the label exist).
`rdat`	the tree fitted data.

References

Examples

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the default setting: scenario='A' and SoM=0.5
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
 
DTRES<-DTfit(odat,vdat)
 
DTRES
 
#get (testing set) MSE manually
mean( ( DTRES$v_predict - vdat$true_STE   )^2   )


library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the default setting: scenario='A' and SoM=0.5
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
 
DTRES<-DTfit(odat,vdat)
 
DTRES
 
#get (testing set) MSE manually
mean( ( DTRES$v_predict - vdat$true_STE   )^2   )

Creat a Simulated Data

Description

getDat creats a toy data based on a certain model

Usage

getDat(N = 150000,
      Nt = 100000,
      Nc = 20,
      scenario='A',
      SoM = 0.5,
      ZXeffect = 0.5,
      Random = TRUE,
      label = TRUE,
      split = TRUE)
getDat(N = 150000,
      Nt = 100000,
      Nc = 20,
      scenario='A',
      SoM = 0.5,
      ZXeffect = 0.5,
      Random = TRUE,
      label = TRUE,
      split = TRUE)

Arguments

`N`	a integer indicates the total sample size (training data size + testing data size)
`Nt`	a integer indicates the size of the training data.
`Nc`	a integer indicates the number of candidate variables.
`scenario`	a character indicates the model scenarios data will be simulated from. `'A'` refers to scenario where no colliders. `'B'` refers to the scenario that half covariates will be colliders. `'C'` refers to more complicated collider scenarios.
`SoM`	a value indicates the strength of modification (i.e. how strong will the covariate modify the treatment effect). Note even if `SoM=0`, there still exists weak modification.
`ZXeffect`	a value indicates the instrument-exposure effect.
`Random`	logical. Indicates whether to use the random value for strenght modification. Default value is `TRUE`, where weak modification always exists even if `SoM=0` due to randomness.
`label`	logical. Indicates whether to added the final column of the simulated data as the true heterogenerous effect (i.e. label). Default value is `TRUE`.
`split`	logical. If `FALSE`, only one data set is returened. The default is `TRUE`, the data will be splitted into the training data and testing data.

Details

The data-generating model is the same as the Scenarios in the original paper, where the first 5 covariates are the true effect modifiers. The covariates in even positions (2,4,...) are the common downstream variables of the exposure and te confounders, therefore causing collider bias (when conditioning on these variables).

Value

getDat() returns a list with three kings of toy data set

`whole.data`	the complete whole dataset
`traning.set`	the training set
`testing.set`	the testing set
`SoM`	Strength of Modification
`modifier_vec`	the modification strength vector for each covariate
`Scenario`	scenario information

the training set and the testing set are of the same form that can be regonized by all the functions. The first four columns are the individual IDs, the instrument, the exposure, and the outcome, respectively. The high-dimensional (dimensions = Nc) covariates are then follwing. The end column is true_STE, representing the individal controlled direct treatment effect (see more details in the original paper).

Examples

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)


set.seed(60)
res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)


set.seed(60)
res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set

Get the Best Covariate Index

Description

GetIndex gives the best covariate index according to the inputed data set. The best covariate is one out of the candidate covariates considered that gives the largest Q statistic value.

Usage

GetIndex(dat_current,
          JJ,
          rate = 1,
          SpecificM = NA,
          method = "DR",
          SoP = 10,
          howGX = "SpecificGX",
          Halve = FALSE,
          const = NA)
GetIndex(dat_current,
          JJ,
          rate = 1,
          SpecificM = NA,
          method = "DR",
          SoP = 10,
          howGX = "SpecificGX",
          Halve = FALSE,
          const = NA)

Arguments

`dat_current`	A data frame. This data frame has the first four columns: individual IDs, the instrument, the expoure, and the outcome with the candidate covariates following.
`JJ`	A positive integer indicates the total number of the candiate covariates.
`rate`	a value ranging from 0 to 1 indicates the proportion of the candidate covariates to be randomly cosidered in each split when construcing the Q-tree.
`SpecificM`	a vector indicates the index of candidate variables that can be considered in spliiting. Tbe default value is `NA`, which means all candidate variables are possible to be considered.
`method`	a character indicates the stratification method used for constructing a Q-tree. There are currently three methods: the Doubly-ranked method (`method='DR'`), the residual methods (`method='Residual'`) and the naive method (recognized by any character except `'DR'` and `'Residual'`). The default method is `'DR'`.
`SoP`	a postive integer (>=2) indicates the size of pre-straum. Only applicable for the doubly-ranked stratification. Default value is 10.
`howGX`	a character indicates the way to calculate the instrument-exposure associations. Two ways are currenly allowed: the instrument-exposure associations estimated seperately at each stratum (`howGX='SpecificGX'`), or use a fixed instrument-exposure association (`howGX='const')`
`Halve`	Logical. Indictes whether to split the node by half:half (i.e. 5:5). Default value is `FALSE`.
`const`	a value indicates the fixed instrument-exposure association value used for constructing the Q-tree. Only applicable for `howGX='SpecificGX'`.

Details

GetIndex helps to decide the splitting covariate at the present node when constructing a Q-tree.

Value

GetIndex returns a vector with the following three values:'Candidate.index', 'Q.value' , 'node.size'

`Candidate.index`	The index (position rank) of the chosen covariate.
`Q.value`	The corresponding Q statistic value of the chosen covariate.
`minimal.node.size.after.split`	The mimimal sample size of the sun-nodes after splitting.
`split.style`	Three values denoted by 1,2,3 representing the splitting style used for the final covariate. They are 3:7, 5:5, 7:3, respectively.

References

Tian, H., Mason, A. M., Liu, C., & Burgess, S. (2022). "Relaxing parametric assumptions for non-linear Mendelian randomization using a doubly-ranked stratification method". bioRxiv, 2022-06. (Doubly-ranked stratification)

Examples

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
GetIndex(odat,JJ=20)
library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
GetIndex(odat,JJ=20)

Get MSE Values for the OOB Data or the Testing Data with Fitted Q-Tree or Random Forest of Q-Trees (RFQT)

Description

getMSE calculates the MSE value for RFQT (or a Q-tree if RFQT only contains one single tree)

Usage

getMSE(RES, 
       indicator = 1
       )
getMSE(RES, 
       indicator = 1
       )

Arguments

`RES`	a fitting list result from `BootstrapTreeFitting` or `RFQTfit(odat,vdat)$RES`.
`indicator`	numeric indicator to judge which type of MSE should be calculated. `indicator=1` for Out-of-Bag (OOB) data and `indicator=2` for th testing data. Defalut value is `indicator=1`.

Value

getMSE returns a vector of MSE with the increase of the number of Q-trees. The stable MSE values should indicate an appropriate number of Q-trees.

Examples

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the deflaut setting: scenario='A' and SoM=0.5
res<-getDat(label=FALSE) 
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
 
##When running RFQT with mutiple Q trees/bootstrap - use parallel computation
Nb<-7 #how many trees in the forest? e.g. 5 trees
cl<-makeCluster(2)
clusterEvalQ(cl=cl , expr=library(dplyr)) 
clusterEvalQ(cl=cl , expr=library(MendelianRandomization) )
clusterExport(  cl=cl ,  varlist=c('odat','vdat',
                                    'GetTree', 'GetNindex', 'GetIndex' )  )
RES<-parSapply(   cl ,  1:Nb, BootstrapTreeFitting  )
stopCluster(cl)
dim(RES)#7 Nb
 
 
getMSE(RES,indicator=1 )
getMSE(RES,indicator=2 )
library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the deflaut setting: scenario='A' and SoM=0.5
res<-getDat(label=FALSE) 
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
 
##When running RFQT with mutiple Q trees/bootstrap - use parallel computation
Nb<-7 #how many trees in the forest? e.g. 5 trees
cl<-makeCluster(2)
clusterEvalQ(cl=cl , expr=library(dplyr)) 
clusterEvalQ(cl=cl , expr=library(MendelianRandomization) )
clusterExport(  cl=cl ,  varlist=c('odat','vdat',
                                    'GetTree', 'GetNindex', 'GetIndex' )  )
RES<-parSapply(   cl ,  1:Nb, BootstrapTreeFitting  )
stopCluster(cl)
dim(RES)#7 Nb
 
 
getMSE(RES,indicator=1 )
getMSE(RES,indicator=2 )

Get the Q Tree End Node Index for Choosen Samples

Description

GetNindex helps to match the Q-tree end node for any inputed samples, and therefore enables to make further analysis like effect prediction for these inputed samples.

Usage

GetNindex(M, 
          rdat, 
          S = NA)
GetNindex(M, 
          rdat, 
          S = NA)

Arguments

`M`	A matrix or data frame contains the candidate covariate information.
`rdat`	A data frame. It is the Q-tree informaton result and obtained by `rdat<-GetTree()`.
`S`	A positive integer indicates the max depth of each sample will explore along the tree. This number should be consistent with the max depth of the fitted Q tree (also refected in `rdat`). The default valus if `NA`, which means the code will automatically use the correct `S` value.

Details

The inputed data set M must contain the same candidate covariate information (with the same order) as the training data for the Q-tree. In addition, the first four columns (individual IDs, the instrument, the exposure, the outcome) need to be removed, so that the fist column starts from the first candidate covariate.

Value

GetNindex returns a vector with the same length of the row number of the inputed data M. Each element represents the end node (i.e. leaf) index according to the reference Q-tree (the tree information was stored within rdat).

Examples

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
 
rdat<-GetTree(odat)
 
 
vdat_Nindex<-GetNindex(vdat[,5:24] ,rdat )
#the M colnumber and order should be same as the training data
 
 
#may use another independent data (estimation data) to 
#calculate the endnode-specific IV/MR estimates
 
#for example: 
#let odat halves into trdata (tree data) and estdata (estimation data); 
#also we can have rdat<-GetTree(trdata)
 
trdat<-odat[1: (nrow(odat)/2) ,]#tree data
estdat<-odat[(nrow(odat)/2+1):nrow(odat) ,] #estimaiton data
 
rdat<-GetTree(trdat)#fitted Q tree
 
estdat$Nindex<-GetNindex(estdat[,5:24] ,rdat ) 
#rdat contains the tree information (i.e. decision rule)
library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data #the deflaut setting: scenario='A' and SoM=0.5
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
 
rdat<-GetTree(odat)
 
 
vdat_Nindex<-GetNindex(vdat[,5:24] ,rdat )
#the M colnumber and order should be same as the training data
 
 
#may use another independent data (estimation data) to 
#calculate the endnode-specific IV/MR estimates
 
#for example: 
#let odat halves into trdata (tree data) and estdata (estimation data); 
#also we can have rdat<-GetTree(trdata)
 
trdat<-odat[1: (nrow(odat)/2) ,]#tree data
estdat<-odat[(nrow(odat)/2+1):nrow(odat) ,] #estimaiton data
 
rdat<-GetTree(trdat)#fitted Q tree
 
estdat$Nindex<-GetNindex(estdat[,5:24] ,rdat ) 
#rdat contains the tree information (i.e. decision rule)

Get the present parameters information

Description

getPars returns the information of the (hyper-)parameters relevant to the Q-tree and RFQT fitting.

Usage

getPars(empty.argument)
getPars(empty.argument)

Arguments

empty.argument

No arguments are needed.

Value

getPars() returns the following information: the total number of candidate covariates, the training data size, the testing data siz, the stratification method, the size of pre-stratum (if applicable), the proportion of the covariates randomly considered in each node split, the maximal tree depth allowed, the way to calculate the instrument-exposure associations, the fixed instrument-exposure association level (if applicable), the minimal node size, the threshold value of the Q statistic.

If the variable is not defined, it will return 'Not defined' with a value in brackets representing the default value for this variable when fitting RFQT.

Note that only the variable defined in the global enviroment will be considered. Even if one defines these variables, the functions like BootstrapTreeFitting and RFQTfit will use the default value if (s)he does not modify the function arguments.

See the original paper for the details of each parameter information.

Author(s)

Haodong Tian

Examples

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

res<-getDat()
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set

method<-'DR'; SoP<-20; rate<-2/5

getPars()
library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

res<-getDat()
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set

method<-'DR'; SoP<-20; rate<-2/5

getPars()

Get predict effects for the OOB Data or the Testing Data with Fitted Random Forest of Q-Trees (RFQT)

Description

getPredict calculates the predicted individual treatment effect estimates for RFQT, either for the OOB data or the testing data.

Usage

getPredict(RES, 
           indicator = 1
           )
getPredict(RES, 
           indicator = 1
           )

Arguments

`RES`	a fitting list result from `BootstrapTreeFitting` or `RFQTfit(odat,vdat)$RES`.
`indicator`	numeric indicator to judge which type of predicted effect should be presented. `indicator=1` for Out-of-Bag (OOB) data and `indicator=2` for th testing data. Defalut value is `indicator=1`.

Details

See the relevant part of the original paper for more details of how to predict an individual effect according to a fitted Q tree (therefore a RFQT).

Value

getPredict returns a matrix where the row corresponds to the sample size (either the OOB data or the testing data, depending on the indicator used) and the column coresponds to the number of Q-trees. Each column represents the individual predicted effects with the present number of Q trees.

Examples

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the deflaut setting: scenario='A' and SoM=0.5
res<-getDat(label=FALSE) 
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
 
##When running RFQT with mutiple Q trees/bootstrap - use parallel computation
Nb<-7 #how many trees in the forest? e.g. 5 trees
cl<-makeCluster(2)
clusterEvalQ(cl=cl , expr=library(dplyr)) 
clusterEvalQ(cl=cl , expr=library(MendelianRandomization) )
clusterExport(  cl=cl ,  varlist=c('odat','vdat',
                                    'GetTree', 'GetNindex', 'GetIndex' )  )
RES<-parSapply(   cl ,  1:Nb, BootstrapTreeFitting  )
stopCluster(cl)
dim(RES)#7 Nb

predict_matrix<-getPredict(RES,indicator=1 )
predict_matrix<-getPredict(RES,indicator=2 )
 
dim(predict_matrix)
###View(predict_matrix)
library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the deflaut setting: scenario='A' and SoM=0.5
res<-getDat(label=FALSE) 
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
 
##When running RFQT with mutiple Q trees/bootstrap - use parallel computation
Nb<-7 #how many trees in the forest? e.g. 5 trees
cl<-makeCluster(2)
clusterEvalQ(cl=cl , expr=library(dplyr)) 
clusterEvalQ(cl=cl , expr=library(MendelianRandomization) )
clusterExport(  cl=cl ,  varlist=c('odat','vdat',
                                    'GetTree', 'GetNindex', 'GetIndex' )  )
RES<-parSapply(   cl ,  1:Nb, BootstrapTreeFitting  )
stopCluster(cl)
dim(RES)#7 Nb

predict_matrix<-getPredict(RES,indicator=1 )
predict_matrix<-getPredict(RES,indicator=2 )
 
dim(predict_matrix)
###View(predict_matrix)

Fit and Get a Q Tree

Description

GetTree fits a data and construct a Q-tree. The fitted data is directly used to build tree (so no bootstrap). The tree information will be stored but not cleaned for further analysis. You may need BootstrapTreeFitting rather than this function.

Usage

GetTree(dat,
        S = 5,
        Qthreshold = 3.84,
        rate = 1,
        SpecificM = NA,
        method = "DR",
        SoP = 10,
        howGX = "SpecificGX",
        Halve = FALSE,
        const = NA,
        endsize = 1000)
GetTree(dat,
        S = 5,
        Qthreshold = 3.84,
        rate = 1,
        SpecificM = NA,
        method = "DR",
        SoP = 10,
        howGX = "SpecificGX",
        Halve = FALSE,
        const = NA,
        endsize = 1000)

Arguments

`dat`	a data frame. This data (without any bootstrap) will be fitted to construct a Q-tree.
`S`	a postive integer indicates the maximum tree depth allowed. Default value is 5 (i.e. the data set is allowed to be splitted at most 5 times). S is also conneted to the argument endsize, both of which work in a similar way. One may only restrict one of them and leave another one be a flexible value (e.g. `S=100` or `endsize=0`).
`Qthreshold`	A positive value used as the threshold value for the Q statistic. Default value is 3.0.
`rate`	a value ranging from 0 to 1 indicates the proportion of the candidate covariates to be randomly cosidered in each split when construcing the Q-tree. When `rate=1`, it means all the candiate covariates will be considered.
`SpecificM`	a vector indicates the index of candidate variables that can be considered in spliiting. Tbe default value is `NA`, which means all candidate variables are possible to be considered.
`method`	a character indicates the stratification method used for constructing a Q-tree. There are currently three methods: the Doubly-ranked method (`method='DR'`), the residual methods (`method='Residual'`) and the naive method (recognized by any character except `'DR'` and `'Residual'`). The default method is `'DR'`.
`SoP`	a postive integer (>=2) indicates the size of pre-straum. Only applicable for the doubly-ranked stratification. Default value is 10.
`howGX`	a character indicates the way to calculate the instrument-exposure associations. Two ways are currenly allowed: the instrument-exposure associations estimated seperately at each stratum (`howGX='SpecificGX'`), or use a fixed instrument-exposure association (`howGX='const')`
`Halve`	Logical. Indictes whether to split the node by half:half (i.e. 5:5). Default value is `FALSE`.
`const`	a value indicates the fixed instrument-exposure association value used for constructing the Q-tree. Only applicable for `howGX='SpecificGX'`.
`endsize`	a positive integer value indicates the minimual size of the node of Q-tree allowed. One can also control the endsize via `S`.

Details

dat must be a data frame that can be regonized by the functions. The first four columns must be the individual IDs, the (one-dimensional) instrument, the exposure (should be NA if not available), and the outcome, respectively. The (high-dimensional) covariates are then follwing. Note that all the functions may be working with incorrect column ordering, but the results are misleading. DO make sure the column variable order is correct.

If simulated data is used, there should be a end column named as true_STE. That is, dat$true_STE is not NULL.

Value

GetTree returns a data set with the same row dimension as the inputed data dat. The returned data is 'augmented' with tree information including the tree end node index, and the splitting information in each split.

References

Examples

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

res<-getDat()
odat<-res$traning.set  #training set

tree_result<-GetTree(odat)
library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

res<-getDat()
odat<-res$traning.set  #training set

tree_result<-GetTree(odat)

Get Varibale Importance Measurements with Fitted Q-Tree or Random Forest of Q-Trees (RFQT)

Description

getVI calculates the Varibale Importance (VI) measurements for a random forest of Q-trees (RFQT).

Usage

getVI(RES, 
      VItype = 2
      )
getVI(RES, 
      VItype = 2
      )

Arguments

`RES`	a fitting list result from `BootstrapTreeFitting` or `RFQTfit(odat,vdat)$RES`.
`VItype`	numeric indicator to judge which type of VI measurement should be calculated. `VItype=1` for the classical way with the known individual treatment effects (i.e. label) and `VItype=2` for case that the individual treatment effects are unavailable (as in most real application). Defalut value is `VItype=2`.

Details

See the VI measurement part of the original paper for more detailed VI introduction and the algorithm.

Value

getVI retuns a vector indicating the variable importacne (VI) measurements for all the candidate covariates (with the same ordering of the candidate covariates to the original fitting data).

Examples

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the deflaut setting: scenario='A' and SoM=0.5
res<-getDat(label=FALSE) 
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
 
##When running RFQT with mutiple Q trees/bootstrap - use parallel computation
Nb<-7 #how many trees in the forest? e.g. 5 trees
cl<-makeCluster(2)
clusterEvalQ(cl=cl , expr=library(dplyr)) 
clusterEvalQ(cl=cl , expr=library(MendelianRandomization) )
clusterExport(  cl=cl ,  varlist=c('odat','vdat',
                                    'GetTree', 'GetNindex', 'GetIndex' )  )
RES<-parSapply(   cl ,  1:Nb, BootstrapTreeFitting  )
stopCluster(cl)
dim(RES)#7 Nb

getVI(RES,VItype=1 )
getVI(RES,VItype=2 )
library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the deflaut setting: scenario='A' and SoM=0.5
res<-getDat(label=FALSE) 
odat<-res$traning.set  #training set
vdat<-res$testing.set  #testing set
 
##When running RFQT with mutiple Q trees/bootstrap - use parallel computation
Nb<-7 #how many trees in the forest? e.g. 5 trees
cl<-makeCluster(2)
clusterEvalQ(cl=cl , expr=library(dplyr)) 
clusterEvalQ(cl=cl , expr=library(MendelianRandomization) )
clusterExport(  cl=cl ,  varlist=c('odat','vdat',
                                    'GetTree', 'GetNindex', 'GetIndex' )  )
RES<-parSapply(   cl ,  1:Nb, BootstrapTreeFitting  )
stopCluster(cl)
dim(RES)#7 Nb

getVI(RES,VItype=1 )
getVI(RES,VItype=2 )

Fit a Random Forest of Q Trees (RFQT)

Description

RFQTfit fits a data set (the training data) by the random forest of Q trees (RFQT), and return the summary results.

Usage

RFQTfit(odat,
        vdat = NA,
        Nb = 5,
        S = 5,
        honest = FALSE,
        rate = 0.4,
        SingleM = FALSE,
        Qthreshold = 3.84,
        method = "DR",
        SoP = 10,
        howGX = "SpecificGX",
        Halve = FALSE,
        endsize = 1000,
        const = NA,
        Cores = NA,
        trackfile=NA
        )
RFQTfit(odat,
        vdat = NA,
        Nb = 5,
        S = 5,
        honest = FALSE,
        rate = 0.4,
        SingleM = FALSE,
        Qthreshold = 3.84,
        method = "DR",
        SoP = 10,
        howGX = "SpecificGX",
        Halve = FALSE,
        endsize = 1000,
        const = NA,
        Cores = NA,
        trackfile=NA
        )

Arguments

`odat`	a data frame. This is recognized as the trainging data set.
`vdat`	a data frame. This is recognized as the testing or validation data set. The default value is `NA`, indicating that no testing data will be considered.
`Nb`	a positive integer indicates the number of bootstrap (i.e. the number of Q-trees)
`S`	a postive integer indicates the maximum tree depth allowed. Default value is 5 (i.e. the data set is allowed to be splitted at most 5 times). `S` is also conneted to the argument `endsize`, both of which work in a similar way. One may only restrict one of them and leave another one be a flexible value (e.g. `S=100` or `endsize=0`).
`honest`	logical value indicates whether to use the honest estimation style or not. Default value is `FALSE`. When honest estimation is used, the tree construction and the leaf-specific estimates will be obtained in two seperate samples of the (boostrapped) training data.
`rate`	a value ranging from 0 to 1 indicates the proportion of the candidate covariates to be randomly cosidered in each split when construcing the Q-tree.
`SingleM`	logical value indicates whether to use one single covariate variable at all times. Default value is `FALSE`. If `SingleM=TRUE`, the covariate used are the best variable indicated by `GetIndex`.
`Qthreshold`	a positive value used as the threshold value for the Q statistic. Default value is 3.84. One can set `Qthreshold=0` to ignore the Q-related stopping rule.
`method`	a character indicates the stratification method used for constructing a Q-tree. There are currently three methods: the Doubly-ranked method (`method='DR'`), the residual methods (`method='Residual'`) and the naive method (recognized by any character except `'DR'` and `'Residual'`). The default method is `'DR'`.
`SoP`	a postive integer (>=2) indicates the size of pre-straum. Only applicable for the doubly-ranked stratification. Default value is 10.
`howGX`	a character indicates the way to calculate the instrument-exposure associations. Two ways are currenly allowed: the instrument-exposure associations estimated seperately at each stratum (`howGX='SpecificGX'`), or use a fixed instrument-exposure association (`howGX='const'`).
`Halve`	logical. Indictes whether to split the node by half:half (i.e. 5:5). Default value is `FALSE`.
`endsize`	a positive integer value indicates the minimual size of the node of Q-tree. `S` and `endsize` work in a similar way.
`const`	a value indicates the fixed instrument-exposure association value used for constructing the Q-tree. Only applicable for `howGX='SpecificGX'`.
`Cores`	a positive integer indicates the number of cores should be used when running random forest in parallel. Default value is the number of CPU cores minus one. You can check the number of CPU cores via `detectCores()`.
`trackfile`	the path string indicates the file to store the running information in parallel.

Details

RFQTfit is an integrated function combining other functions, including GetTree, BootstrapTreeFitting, getPredict, getVI, etc. The fitting automatically uses parallel computation.

Value

RFQTfit shows all the parameter information used for random forest fitting in the begining, and returns in the end a list consisting of the following components:

`RES`	The direct fitting results of each Q-tree, which contains `end_node_information`, `OOB_predict`, `v_predict`, `vi1`, `vi2`, `ts1` and `ts2` (same as the results of `BootstrapTreeFitting`) for each Q-tree. Note that `RES` is multi-dimensional. If one wishes to obtain the individual predicted effects of the testing set according to i-th Q-tree, should run `RES[3,i]$v_predict`.
`MSE_OOB`	the `Nb`-length vector of MSE results for OOB samples. If one determine the number of Q-trees to be `Nb`, simply use the final vlue of `MSE_OOB` as the MSE.
`MSE_test`	the `Nb`-length vector of MSE results for test samples (if the testing set exists). If one determine the number of Q-trees to be `Nb`, simply use the final vlue of `MSE_test` as the MSE.
`Predict_OOB`	the matrix of predicted effects for OOB samples, where the row corresponds to the sample size (the testing data) and the column coresponds to the number of Q-trees. Each column represents the individual predicted effects with the present number of Q trees. If one determine the number of Q-trees to be `Nb`, simply use the final colunm of `Predict_OOB` as the individual predicted effect vector.
`Predict_test`	the matrix of predicted effects for testing set, where the row corresponds to the sample size (the testing data) and the column coresponds to the number of Q-trees. Each column represents the individual predicted effects with the present number of Q trees. If one determine the number of Q-trees to be `Nb`, simply use the final colunm of `Predict_test` as the individual predicted effect vector.
`VI1`	the vector indicates the type I (i.e. with labels) variable importacne (VI) measurements for all the candidate covariates (with the same ordering of the candidate covariates to the original fitting data).
`VI2`	the vector indicates the type II (i.e. without labels) variable importacne (VI) measurements for all the candidate covariates (with the same ordering of the candidate covariates to the original fitting data).
`TS`	the vector of the permutation test statistic values. The two values represent the values of TS1 and TS2, respectively.

References

Examples

library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the deflaut setting: scenario='A' and SoM=0.5
odat<-res$traning.set  
vdat<-res$testing.set  

###try: ALLRES<-RFQTfit(odat)
 
###try: ALLRES<-RFQTfit(odat,vdat,Nb=200)


library(MendelianRandomization)
library(tidyverse) 
library(data.table) 
library(parallel)

set.seed(60)
res<-getDat() #simulated data  #the deflaut setting: scenario='A' and SoM=0.5
odat<-res$traning.set  
vdat<-res$testing.set  

###try: ALLRES<-RFQTfit(odat)
 
###try: ALLRES<-RFQTfit(odat,vdat,Nb=200)

Package 'RFQT'

Help Index

One Bootstrap Q Tree Fitting

Description

Usage

Arguments

Details

Value

References

Examples

Fit a Classical Single Decision Tree (DT)

Description

Usage

Arguments

Details

Value

References

Examples

Creat a Simulated Data

Description

Usage

Arguments

Details

Value

Examples

Get the Best Covariate Index

Description

Usage

Arguments

Details

Value

References

Examples

Get MSE Values for the OOB Data or the Testing Data with Fitted Q-Tree or Random Forest of Q-Trees (RFQT)

Description

Usage

Arguments

Value

Examples

Get the Q Tree End Node Index for Choosen Samples

Description

Usage

Arguments

Details

Value

Examples

Get the present parameters information

Description

Usage

Arguments

Value

Author(s)

Examples

Get predict effects for the OOB Data or the Testing Data with Fitted Random Forest of Q-Trees (RFQT)

Description

Usage

Arguments

Details

Value

Examples

Fit and Get a Q Tree

Description

Usage

Arguments

Details

Value

References

Examples

Get Varibale Importance Measurements with Fitted Q-Tree or Random Forest of Q-Trees (RFQT)

Description

Usage

Arguments

Details

Value

Examples

Fit a Random Forest of Q Trees (RFQT)

Description

Usage

Arguments

Details