--- title: "MRLocus - estimation of gene-to-trait effects" date: "`r format(Sys.Date(), '%m/%d/%Y')`" author: "Anqi Zhu, Nana Matoba, Jason Stein, Michael Love" output: rmarkdown::html_document: self_contained: false highlight: pygments abstract: | MRLocus provides estimation of gene-to-trait effects from eQTL and GWAS summary statistics, for loci with allelic heterogeneity: multiple nearly-LD-independent signal clusters representing distinct eQTLs. Evidence is aggregated across signal clusters for mediation of the downstream GWAS trait through the gene. MRLocus package version: `r packageVersion("mrlocus")` bibliography: library.bib vignette: | %\VignetteIndexEntry{MRLocus - estimation of gene-to-trait effects} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo=FALSE, results="hide"} knitr::opts_chunk$set(cache=FALSE, message=FALSE, warning=FALSE) ``` # Introduction MRLocus provides Bayesian analysis of gene-to-trait mediation effects, making use of the Stan and RStan software packages for inference [@stan;@rstan]. The authors of the associated [MRLocus manuscript](https://doi.org/10.1371/journal.pgen.1009455) [@mrlocus] are: > Anqi Zhu\*, Nana Matoba\*, Emma P. Wilson, Amanda L. Tapia, > Yun Li, Joseph G. Ibrahim, Jason L. Stein, Michael I. Love \* Both authors contributed equally to this work. # Data input and preprocessing In this vignette we demonstrate use of MRLocus on simulated summary statistics. The simulated data in this vignette does not include the steps of importing summary statistics and LD matrices from external files. For that, we recommend looking at the R code we used for analysis of the [public datasets](https://github.com/mikelove/mrlocusPaper/blob/master/public/mrlocus.R) in the manuscript, and see also the basic data input information in the README (also shown on the MRLocus [homepage](https://mikelove.github.io/mrlocus)). Briefly, three important helper functions for import and preprocessing that are provided in this package are `collapseHighCorSNPs`, `flipAllelesAndGather`, and `trimClusters`, for importing summary statistics across nearly-LD-independent signal clusters (sets of SNPs that are non-overlapping and have low correlation in a reference population). We use the phrase "LD-independent signal cluster" as in the PTWAS software and paper [@ptwas]. It is strongly recommended to use these three functions for data import, in particular as the first function will reduce the number of SNPs per cluster and help deal with identifiability issues during the Bayesian colocalization step. As described in the manuscript, MRLocus does not attempt to resolve putative causal SNPs among sets of SNPs with correlation $> 0.95$, but instead it collapses this redundancy using `collapseHighCorSNPs`. The second function ensures that the alleles are flipped properly across eQTL and GWAS datasets (the function will handle this to avoid manual processing by the user), such that the alleles are coded in such a way that the downstream statistical modeling functions will expect. The second function ensures that both studies refer to the same allele, which is the expression-increasing allele for an index SNP within each nearly-LD-independent signal cluster. Finally, the last function is used to ensure that the candidate instruments have $r^2$ values below a specified threshold (so that they are satisfying the "nearly-LD-independent" description). We recommend using PLINK [@plink] for generation of signal clusters (see Methods section of MRLocus manuscript). We have used raw p-value thresholds from eQTL in order to cull lists of signal clusters (instruments) which have an effect on the putative mediator (gene expression). Note that we do not recommend running MRLocus on loci with only one signal cluster, i.e. loci lacking allelic heterogeneity. The `fitSlope` function will provide a warning if such input data is provided. The estimation of the slope is impaired with a single signal cluster, and estimation of the dispersion across signal clusters is not possible. We have not yet implemented, but plan on providing an import function from the [TwoSampleMR](https://github.com/MRCIEU/TwoSampleMR) package. Un-evaluated example code for the two preprocessing functions follows. In the first function the arguments are: * `sum_stat` - list of summary statistics over signal clusters * `ld_mat` - list of LD matrices over signal clusters * optionally, `ld_mat2` can also be provided for analyses with two populations ```{r eval=FALSE} # (un-evaluated chunk) data1 <- collapseHighCorSNPs(sum_stat, ld_mat) ``` The second function takes the output of the first function, and additionally a number of column names. E.g. this function call implies that we have column names `ref_eqtl`, `effect_eqtl`, `beta_eqtl`, `se_eqtl`, etc. See the function man page for more details. If the column names are of the form `eqtl_beta`, `eqtl_se`, etc. such that the A/B study specification is first in the column names, the user should then set `ab_last=FALSE`. ```{r eval=FALSE} # (un-evaluated chunk) data2 <- flipAllelesAndGather(data1$sum_stat, data1$ld_mat, a="eqtl", b="gwas", ref="ref", eff="effect", beta="beta", se="se", a2_plink="major_plink", snp_id="SNP", sep="_") ``` In order to inspect whether allele flipping worked as expected, this function will by default make a plot of the initial estimates, or you can manually run `plotInitEstimates` on the output of this function. In the next section an evaluation of this plot is discussed. The third helper function, `trimClusters`, can be run after `extractForSlope` (this occurs after colocalization). # QC of input pre-colocalization The following simulated summary statistics are generated from a multivariate normal distribution, similar to the model used in eCAVIAR [@eCAVIAR]. In contrast, the simulations in the MRLocus manuscript make use of the [twas_sim](https://github.com/mancusolab/twas_sim) framework [@twassim]. ```{r} library(mrlocus) set.seed(1) data <- makeSimDataForMrlocus() ``` This simulated data consists of a list of lists. The elements are: `beta_hat_a`, `beta_hat_b`, `se_a`, `se_b`, `Sigma_a`, `Sigma_b`, and `alleles`, which provide the estimated coefficients, standard errors (SE), and LD matrices for the two studies, A and B (typically, eQTL and GWAS), as well as a data.frame with allele information (SNP ID, reference allele, and effect allele). The LD matrices can be identical if distinct matrices for each study are not available. Each element itself should be a list of vectors/matrices/data.frames across nearly-LD-independent signal clusters. We can make a simple plot of the `beta_hat`'s, estimated coefficients from A and B. The coefficients are numbered by the SNP order within each signal cluster. ```{r initests, fig.height=7} plotInitEstimates(data) ``` **Evaluating this plot:** The input data for colocalization, e.g. after running `collapseHighCorSNPs` and `flipAllelesAndGather` on real eQTL and GWAS summary statistics within PLINK clumps, is typically not more than 20-30 SNPs per clusters, and with a certain pattern of effect sizes within each cluster. You would not expect to have SNPs *within a cluster* having very discordant effects size in the x- or y-direction, and quality control of the allele coding is important for the colocalization step to give reasonable results. It is fine however for their to be discordant effect sizes in the y-direction *across clusters*, this is important evidence against a consistent mediation effect. # Colocalization step with MRLocus The first step in MRLocus is to perform colocalization of the studies. To do so, we loop over the signal clusters, and store the fitted models in a list called `fit`. ```{r} coloc_fit <- list() nclust <- length(data$beta_hat_a) lengths(data$beta_hat_a) ``` The following code chunk loops over signal clusters: ```{r} options(mc.cores=2) for (j in 1:nclust) { coloc_fit[[j]] <- with(data, fitBetaColoc( beta_hat_a = beta_hat_a[[j]], beta_hat_b = beta_hat_b[[j]], se_a = se_a[[j]], se_b = se_b[[j]], Sigma_a = Sigma_a[[j]], Sigma_b = Sigma_b[[j]] )) } ``` We now create a list containing the posterior mean from the colocalization step, and the original SE values. Note the SE values are now referred to as `sd`. We can extract the list of e.g. `beta_hat_a` using `lapply` and the `[[` extractor as seen below: ```{r} res <- list(beta_hat_a = lapply(coloc_fit, `[[`, "beta_hat_a"), beta_hat_b = lapply(coloc_fit, `[[`, "beta_hat_b"), sd_a = data$se_a, sd_b = data$se_b, alleles = data$alleles) ``` # Colocalization with eCAVIAR (optional alternative) Alternatively, MRLocus can accept colocalization results from eCAVIAR [@eCAVIAR] as input to the slope fitting step. eCAVIAR is run with default options on each nearly-LD-independent signal cluster separately, supplying the LD matrix, and z-scores for study A and B, and `-c 1`, i.e. setting the maximum number of causal SNPs to 1. For eCAVIAR colocalization, the SNP with the largest colocalization posterior probability (CLPP) is selected from each signal cluster, among those SNPs which are valid instruments, i.e. with an absolute value of z-score as large as the value used for PLINK clumping [@plink]. eCAVIAR alleles are flipped such that the alternate allele is the one corresponding to an increase in the measured phenotype in A (e.g. gene expression). # Slope fitting step Following MRLocus colocalization, the following function, `extractForSlope`, extracts one SNP per signal cluster with the largest posterior mean for the coefficient for study A (e.g. eQTL). For eCAVIAR colocalization, a simple R script can be used to select the SNP with largest CLPP per nearly-LD-independent signal cluster, and to pass the estimated coefficients and SEs for A and B as vectors in a named list to `fitSlope` (see man page for details on data input, and the below code chunk for an example of the structure of the input). The selected SNPs and their posterior mean values are plotted on the right panel in blue. We note that it is also possible to perform EM-based clustering [@mclust] to extract more than one SNP per signal cluster, through use of the `niter` argument, although this was not evaluated in the MRLocus manuscript. ```{r extract} res <- extractForSlope(res) str(res) # input to fitSlope() ``` After running `extractForSlope`, we recommend to use the helper function `trimClusters` to ensure that the instruments chosen for slope modeling are nearly LD independent, by specifying an `r2_threshold`. Now we are ready to fit the slope. In our testing we have used 10,000 iterations, though fewer may be sufficient. This step is typically much faster than the colocalization step. ```{r} res <- fitSlope(res, iter=10000) ``` # Normalized allelic spread We have found that a convenient statistic regarding the allelic dispersion `sigma`, which is comparable across loci, is to normalize by a typical value on the y-axis. We therefore normalize by the mediated effect for the mean of instrument effect sizes. This can be computed using the `normalizedAllelicSpread` function for a given fit object: ```{r} normalizedAllelicSpread(res) ``` # Examine MRLocus estimates One of the elements in the results list `res` is the `stanfit` object, which can be explored with functions from the `rstan` package. The posterior mean for the gene-to-trait slope, alpha, is near the simulated value of 0.5. The posterior mean for the dispersion around the slope, sigma, is also estimated. Note that the estimate for sigma has a large 80% credible interval, as we only have 4 signal clusters over which to estimate dispersion. ```{r} suppressPackageStartupMessages(library(rstan)) print(res$stanfit, pars=c("alpha","sigma"), probs=c(.1,.9), digits=3) ``` MRLocus provides a plot that summarizes the estimation of the gene-to-trait effect. The estimated slope and an 80% credible interval are indicated with a solid blue line and dashed blue lines, respectively. The black points and error bars represent the posterior mean for top SNPs from colocalization and the original SE. An 80% band in light blue indicates a region that is expected to contain 80% of the eQTL-GWAS pairs across signal clusters, so reflecting the allelic heterogeneity. If the estimate of $\sigma$ is of comparable size to $\alpha$ times typical eQTL effect sizes (slope $\times$ location of black points on the x-axis), this indicates a high degree of heterogeneity. ```{r mrlocus, fig.height=7} plotMrlocus(res, main="MRLocus gene-to-trait effect estimate") ``` # Prior predictive check Finally, we have a function which helps to assess the strength of the prior for the slope (alpha) and the dispersion around the fitted line (sigma). We generate a table showing the prior SD for these two parameters and the posterior SD from the fitted model. We note that some of the parameters of the prior (SD of `beta`, the instrument effects, and SD of `alpha`, the slope) are informative by design (so ratios > 0.1 are not a problem per se), but it is not desired that the ratio of posterior to prior SD approach 1. We additionally generate two plots, the first plot generates parameters from the prior and the model, showing the possible values for `alpha`, `beta_a` and `beta_b` (the true effect sizes from the two studies). In the second plot, `alpha` is fixed to the prior mean (by default, 0) to better visually identify the simulated values for `sigma`. The blue X's indicate the fitted estimates of `beta_a` and `beta_b` from the colocalization step, and the blue lines indicate the posterior for `alpha`. ```{r priorCheck} priorCheck(res) ``` The default is for the function to return the small summary table, but by setting `type=2`, it will alternatively return the prior predictive draws themselves, with the number of draws specified by `n` (default is 200). # Session info ```{r} sessionInfo() ``` # References