Title: | Constrained Instrumental Variables in Mendelian Randomization with Pleiotropy |
---|---|
Description: | In Mendelian randomization (MR), genetic variants are used to construct instrumental variables that then enable inference about the causal relationship between a phenotype of interest and a response or disease outcome. However, valid MR inference requires several assumptions, including the assumption that the genetic variants only influence the response through the phenotype of interest.Pleiotropy occurs when a genetic variant has an effect on more than one different phenotypes, and therefore a pleiotropic genetic variant may be an invalid instrumental variable.Hence, a naive method for constructing an instrumental variables may lead to biased estimation of the association between the phenotype and the response. Here, we encode a new and intuitive method (Constrained Instrumental Variable method [CIV]) to construct valid instrumental variables and perform adjusted causal effect estimation when pleiotropic exists, focusing particularly on the situation where pleiotropic phenotypes have been measured. Our approach is theoretically guaranteed to perform an automatic and valid selection of genetic variants when building the instrumental variable. We also provide details of the features of many existing methods, together with a comparison of their performance in a large series of simulations. CIV performs robustly across many different pleiotropic violations of the MR assumptions. |
Authors: | Lai Jiang [aut, cre] |
Maintainer: | Lai Jiang <[email protected]> |
License: | What license is it under? |
Version: | 0.0.1 |
Built: | 2025-01-02 05:12:44 UTC |
Source: | https://github.com/LaiJiang/CIVMR |
A dataset collected from ADNI project for MR analysis in the CIV paper.
It contains 491 subjects, whose phenotypes (, Ptau, Ttau, Glucose levels) and
Alzheimer's status were collected. 20 SNPs associated with
were selected and
their dosage of the 491 subjects were recorded. This data file is extracted to serve as an example
to estimate the causal effect of
on progression of Alzheimer's disease while accounting for
potential pleiotropic effect from other phenotypes.
data(ADNI)
data(ADNI)
ADNI$Y: |
the Alzheimer's disease status. Continuous variable. The raw status is binary variable, and we adjusted it for confounding factors such as sex, age, education ... etc. |
ADNI$X: |
The phenotype of interest |
ADNI$Z: |
The potential pleiotropic phenotypes (Ptau, Ttau, Glucose levels). Continuous variables. |
ADNI$G: |
genotypes. The adjusted dosage of 20 SNPs. |
An object of class "data.frame"
.
data(ADNI) X <- ADNI$X Z <- ADNI$Z G <- ADNI$G Y <- ADNI$Y
data(ADNI) X <- ADNI$X Z <- ADNI$Z G <- ADNI$G Y <- ADNI$Y
This function implement Allele score methods with cross-validation in the way Stephen Burgess suggested in the Allele score methods paper.
allele(MR.data, n_folds = 10)
allele(MR.data, n_folds = 10)
MR.data: |
data frame containing G,X,Z,Y. |
n_folds: |
the number of folds for cross-validation. |
weights: the weights for allele score across folds. Each row is a weight vector corresponding to a specific fold.
allele_score: The cross-validated Allele score, which would be used as the new instruments in MR analysis.
beta_est: the causal effect estimation of .
data(simulation) allele.score <- allele(simulation,n_folds=10)
data(simulation) allele.score <- allele(simulation,n_folds=10)
This function generate a bootstrapped CIV w/wo correction. Specifically, for a bootstrap sample we can generate civ solution u. The boostrap corrected solution u is obtained as the global solution u - ( bootrapped average u - global u).
boot_CIV(MR.data, n_boots = 10)
boot_CIV(MR.data, n_boots = 10)
MR.data: |
a data frame containing G,X,Z,Y. |
n_boots: |
number of bootstrap samples. |
boots.u: bootstrapped CIV solution u (without correction).
boots.cor.u: bootstrap corrected solution of u. (suggested)
data(simulation) boot.civ <- boot_CIV(simulation) #plot the bootstrap corrected solution u. plot(boot.civ$boots.cor.u)
data(simulation) boot.civ <- boot_CIV(simulation) #plot the bootstrap corrected solution u. plot(boot.civ$boots.cor.u)
This function find the unique CIV solution.
CIV(MR.data)
CIV(MR.data)
MR.data: |
A data.frame() object containg G,X,Z,Y. |
c: solution vector to the constrained maximization problem.
max_value: the maximized correlation value.
CIV: the new instrumentable variable for MR analysis: constrained instrumental variable.
data(simulation) civ.fit <- CIV(simulation) #plot the weight c plot(civ.fit$c)
data(simulation) civ.fit <- CIV(simulation) #plot the weight c plot(civ.fit$c)
This function produce a Constrained Instrumental Variable with cross-validation. Specifically, for a predefined fold the CIV is calculated using all samples except this fold, then the CIV solution is applied to the samples in this fold to obtain corresponding CIV. In this way the correlation between samples are expected to be reduced.
cv_CIV(MR.data, n_folds = 10)
cv_CIV(MR.data, n_folds = 10)
MR.data: |
a data frame containing G,X,Z,Y. |
n_folds: |
number of folds for cross-validation. |
weights: A matrix with dimension . Each row is a CIV solution
from a specific fold.
civ.IV: cross-validated CIV instrument .
beta_est: causal effect estimation of X on Y using CIV instrument civ.IV
data(simulation) cv.civ <- cv_CIV(simulation) #strong correlation between CIV solutions from different folds. cor(t(cv.civ$weights))
data(simulation) cv.civ <- cv_CIV(simulation) #strong correlation between CIV solutions from different folds. cor(t(cv.civ$weights))
This function remove highly correlated IVs. An upgraded function SNP_reduction() is suggested.
IV_reduction(snp_matrix, crit_high_cor = 0.8)
IV_reduction(snp_matrix, crit_high_cor = 0.8)
snp_matrix: |
IV matrix with dimension n * p. |
crit_high_cor: |
criteria to choose highly correlated SNPs. default is 0.8 correlation. |
sel_snp: the selected IVs.
id_snp: the ids (columns) of selected IVs in the original IV matrix.
data(simulation) snp.rdc <- IV_reduction(simulation$G)
data(simulation) snp.rdc <- IV_reduction(simulation$G)
This function implements linear algebra steps to acquire necessary matrices for CIV construction.
LA_decomposition(G, X, Z)
LA_decomposition(G, X, Z)
G: |
original instruments with dimension nXp. |
X: |
phenotype of interest. dimension nXk. |
Z: |
possible pleiotropic phenotypes which have been measured. dimension nXr. |
A list of matrices which will be called by solv_pcc() and pcc_IV().
data(simulation) LA_decomposition(simulation$G,simulation$X,simulation$Z)
data(simulation) LA_decomposition(simulation$G,simulation$X,simulation$Z)
univaraite t-test pvalues for a regression.
lmp(modelobject)
lmp(modelobject)
modelobject: |
a regression object. |
p: pvalue.
Given response $Y$ and a set of features $X$, this function obtains univariate T-test p-values for each of the feature for selection purpose.
lmPvalue(Y, X)
lmPvalue(Y, X)
Y: |
response variable. |
X: |
the independent features. |
pvalue_list: the list of Pvalues for feature selection based on univariate T-test.
data(simulation) p.values <- lmPvalue(simulation$X, simulation$G)
data(simulation) p.values <- lmPvalue(simulation$X, simulation$G)
This function find multiple CIV solutions that are orthogonal to each other. Only the first one achive the global maximum correlation.
pcc_IV(A, B, G, inv_GG_square, no_IV = ncol(G) - ncol(B))
pcc_IV(A, B, G, inv_GG_square, no_IV = ncol(G) - ncol(B))
A: |
matrix given by LA_decomposition(). |
B: |
matrix given by LA_decomposition(). |
G: |
original instruments. |
u_max: the solution of u that would maximize the constrained correlation problem.
data(simulation) #CIV linear algebra decomposition components civ.deco <- LA_decomposition(simulation$G,simulation$X,simulation$Z) #solve the CIV solution for c civ.mult <- pcc_IV(civ.deco$A, civ.deco$B, simulation$G, civ.deco$inv_GG_square)
data(simulation) #CIV linear algebra decomposition components civ.deco <- LA_decomposition(simulation$G,simulation$X,simulation$Z) #solve the CIV solution for c civ.mult <- pcc_IV(civ.deco$A, civ.deco$B, simulation$G, civ.deco$inv_GG_square)
this function removes IVs with extreme low correlation and extreme high prediction error. This is an experimental function to check how many redundant solutions are found in smooth.opt object.
rm_outlier_IV(smooth_IV, MR.data, crit = 0.9, sigma_min = 0.01)
rm_outlier_IV(smooth_IV, MR.data, crit = 0.9, sigma_min = 0.01)
smooth_IV: |
an object from smooth_CIV() function. |
MR.data: |
data frame containing G,X,Z,Y. |
......: |
default values for other tuning parameters. |
IV_mat: the final matrix of CIV instruments.
u_mat: the final CIV solutions of u. Each column is a distinct solution.
data(simulation) G <- simulation$G X <- simulation$X Z <- simulation$Z Y <- simulation$Y smooth.opt <- smooth_CIV( G,X,Z,Y, k_folds = 10) smooth.clean <- rm_outlier_IV(smooth.opt, simulation) dim(smooth.clean$u_mat) #check how many solutions are different. It is probability much less than 100.
data(simulation) G <- simulation$G X <- simulation$X Z <- simulation$Z Y <- simulation$Y smooth.opt <- smooth_CIV( G,X,Z,Y, k_folds = 10) smooth.clean <- rm_outlier_IV(smooth.opt, simulation) dim(smooth.clean$u_mat) #check how many solutions are different. It is probability much less than 100.
A simulated data using the same framework from simulation series I in the CIV paper. It contains 500 subjects with one phenotype (X) of interest, one pleiotropic phenotype (Z) and one outcome (Y) simulated for each subject. 9 SNPs were generated and they were associated with both X and Z. The true causal effect of X on Y is 1. This data file is simulated to serve as an example to estimate the causal effect of X on Y while accounting for potential pleiotropic effect from Z. Users can compare the performance of different MR methods on this simulation dataset since the true causal effect is known.
data(simulation)
data(simulation)
simulation$Y: |
the simulated outcome Y. Continuous variable. |
simulation$X: |
The simulated phenotype of interest X. Continuous variable. |
simulation$Z: |
The potential pleiotropic phenotype Z. Continuous variable. |
simulation$G: |
The simulated genotypes. The dosage of 9 independent SNP variants were simulated with a minor allele frequency of 0.3 for all 500 subjects. |
An object of class "data.frame"
.
data(simulation) X <- simulation$X Z <- simulation$Z geno <- simulation$G outcome <- simulation$Y
data(simulation) X <- simulation$X Z <- simulation$Z geno <- simulation$G outcome <- simulation$Y
This function first find the optimal value of according
to projected prediction error with cross-validation. Then for a given
value multiple intial
points are used to explore potentially multiple modes.
smooth_CIV(G, X, Z, Y, lambda_list = NULL, k_folds = 10, sigma_min = 0.01, sigma_up = 0.5, stepsize = 0.1, conv_iters = 5, stepsize_last = 1e-04, last_conv_iters = 2000, method_lambda = "er", n_IV = 100)
smooth_CIV(G, X, Z, Y, lambda_list = NULL, k_folds = 10, sigma_min = 0.01, sigma_up = 0.5, stepsize = 0.1, conv_iters = 5, stepsize_last = 1e-04, last_conv_iters = 2000, method_lambda = "er", n_IV = 100)
initial: |
the initial value for updating u. |
G: |
SNP matrix with dimension |
X: |
phenotype of interest. |
Z: |
pleiotropic phenotype Z. |
Y: |
the disease outcome Y. |
lambda_list: |
a list of values for regularization parameter lambda. A default list will be chosen if not provided. |
k_folds: |
number of folds for cross-validation (to find optimum |
n_IV: |
the number of initial points chosen to explore potential multiple modes. The converged solutions will be screened to delete redundant solutions. So the final solutions will be less or equal to n_IV. default = 100. |
sigma_min: |
the minimum value of |
sigma_up: |
the moving down multiplier. |
stepsize: |
the stepsize to move solution u. default = 0.1. |
conv_iters: |
the maximum steps to allow updating when a converged solution is found. default =5. |
stepsize_last: |
When a converged solution is found with stepsize, we update this solution with a smaller stepsize to achive a more precise local maximum solution. default = 0.0001. |
last_conv_iters: |
the maximum iterations to run in the stage of “refining" optimum solution. default = 2000. |
......: |
default values for other tuning parameters. |
opt_lambda: the chosen optimum value of corresponding to the minimum projected prediction error (see paper).
IV_mat: the final matrix of CIV instruments with respect to the opt_lambda. Each column is a new instrument.
u_mat: the final CIV solutions of u with respect to the opt_lambda. Each column is a converged solution.
G_pred_error_list: the projected prediction error according to the list values of .
Pred_error_list: the prediction error according to the list values of .
data(simulation) G <- simulation$G X <- simulation$X Z <- simulation$Z Y <- simulation$Y smooth.opt <- smooth_CIV( G,X,Z,Y, k_folds = 10) plot(smooth.opt$u_mat[,1]) #plot a solution u.
data(simulation) G <- simulation$G X <- simulation$X Z <- simulation$Z Y <- simulation$Y smooth.opt <- smooth_CIV( G,X,Z,Y, k_folds = 10) plot(smooth.opt$u_mat[,1]) #plot a solution u.
. (Internal function)This function finds a CIV_smooth solution of u given a value of . This function is mostly for internal use. smooth_CIV() is suggested for users to obtain optimal solutions of CIV_smooth.
smooth_L0_lambda(initial = NULL, null_space, G, X, GTG, lambda, sigma_min = 0.01, sigma_up = 0.5, stepsize = 0.1, conv_iters = 5, stepsize_last = 1e-04, last_conv_iters = 2000, GTMG, ZTG, GTZ, ZTG_ginv, accuracy_par = 1e-10)
smooth_L0_lambda(initial = NULL, null_space, G, X, GTG, lambda, sigma_min = 0.01, sigma_up = 0.5, stepsize = 0.1, conv_iters = 5, stepsize_last = 1e-04, last_conv_iters = 2000, GTMG, ZTG, GTZ, ZTG_ginv, accuracy_par = 1e-10)
initial: |
the initial point of u for updating. The CIV solution will be used as the initial point if no choice is made. |
G: |
SNP matrix with dimension |
X: |
phenotype of interest. |
Z: |
pleiotropic phenotype Z. |
GTG: |
|
GTMG: |
|
ZTG: |
|
GTZ: |
|
ZTG_ginv: |
general inverse of |
null_space: |
null space of matrices G'Z (null(G'Z)). |
lambda: |
a given value (must be specified) for regularization parameter |
accuracy_par: |
the accuracy threshold parameter to determine if the algorithm converged to a local maximum. Default is 1e-10. |
last_conv_iters: |
the maximum iterations to run. Default is 2000. |
......: |
default values for other tuning parameters. |
mat_u: the trace of all updated iterations of u.
opt_solution: the final solution of u.
value_list: the iteration values of target function (penalized correlation).
unstrained_val_list: the iteration values of correlation between X and Gu.
dev_list: the iteration values of deviance between updated vector of u.
n_iters_stage: the number of iterations before finishing updating. If this value < last_conv_iters, then the algorithm stopped at a solution of u without using up its updating quota.
sigma_stage: the updating values of that are used in each iteration.
stepsize_list: the updating values of stepsize that are used in each iteration.
data(simulation) G <- simulation$G X <- simulation$X Z <- simulation$Z GTG <- crossprod(G,G) M <- tcrossprod ( tcrossprod ( X , solve(crossprod(X,X) ) ), X ) GTMG <- crossprod(G, crossprod(M,G)) ZTG <- crossprod(Z,G) GTZ <- crossprod(G,Z) null_space <- Null( GTZ) ZTG_ginv <- ginv(ZTG) lambda <- 1 smooth.lambda1 <- smooth_L0_lambda(null_space = null_space, G = G, X = X, GTG = GTG, lambda = lambda, GTMG = GTMG, ZTG = ZTG, GTZ = GTZ, ZTG_ginv = ZTG_ginv ) plot(smooth.lambda1$opt_solution) #plot the final solution u
data(simulation) G <- simulation$G X <- simulation$X Z <- simulation$Z GTG <- crossprod(G,G) M <- tcrossprod ( tcrossprod ( X , solve(crossprod(X,X) ) ), X ) GTMG <- crossprod(G, crossprod(M,G)) ZTG <- crossprod(Z,G) GTZ <- crossprod(G,Z) null_space <- Null( GTZ) ZTG_ginv <- ginv(ZTG) lambda <- 1 smooth.lambda1 <- smooth_L0_lambda(null_space = null_space, G = G, X = X, GTG = GTG, lambda = lambda, GTMG = GTMG, ZTG = ZTG, GTZ = GTZ, ZTG_ginv = ZTG_ginv ) plot(smooth.lambda1$opt_solution) #plot the final solution u
This function remove highly correlated SNPs. It also calculate MAF for each snp and delete rare snps with low MAF (e.g. 0.01).
SNP_reduction(snp_matrix, crit_high_cor = 0.8, maf_crit = 0.01)
SNP_reduction(snp_matrix, crit_high_cor = 0.8, maf_crit = 0.01)
snp_matrix: |
SNP matrix with dimension n * p. |
crit_high_cor: |
criteria to choose highly correlated SNPs. |
maf_crit: |
criteria to choose rare SNPs. |
sel_snp: the new dosage matrix of selected SNPs.
id_snp: the ids (columns) of selected SNPs in the original SNP matrix.
data(simulation) snp.rdc <- SNP_reduction(simulation$G)
data(simulation) snp.rdc <- SNP_reduction(simulation$G)
This function find a unique solutin to the constrained instrument problem given matrix A and B.
solve_pcc(A, B)
solve_pcc(A, B)
A: |
matrix given by LA_decomposition(). |
B: |
matrix given by LA_decomposition(). |
c: solution to the constrained maximization problem.
max_value: the maximized correlation value.
data(simulation) #CIV linear algebra decomposition components civ.deco <- LA_decomposition(simulation$G,simulation$X,simulation$Z) #solve the CIV solution \eqn{u} civ.c <- solve_pcc(civ.deco$A, civ.deco$B) #plot the weight c plot(civ.c$c)
data(simulation) #CIV linear algebra decomposition components civ.deco <- LA_decomposition(simulation$G,simulation$X,simulation$Z) #solve the CIV solution \eqn{u} civ.c <- solve_pcc(civ.deco$A, civ.deco$B) #plot the weight c plot(civ.c$c)
This function implement ordinary two stage least square regression and provide variance estimation (if requested).
TSLS_IV(MR.data, Fstats = FALSE, var_cal = FALSE)
TSLS_IV(MR.data, Fstats = FALSE, var_cal = FALSE)
MR.data: |
data frame containing G,X,Z,Y. |
Fstats: |
return F-statistics or not. If multiple phenotypes (X) are used, Pillai statistics will be used instead. |
var_cal: |
return variance estimation or not. |
coef: the causal effect estimation .
var: the variance estimation of . if var_cal=TRUE.
stats: F-statistics (or Pillai statistics). if Fstats=TRUE.
pvalue: the pvalue of F-statistics. if Fstats=TRUE.
data(simulation) TSLS_IV(simulation,Fstats=TRUE,var_cal=TRUE)
data(simulation) TSLS_IV(simulation,Fstats=TRUE,var_cal=TRUE)