---
title: "MRLocus - estimation of gene-to-trait effects"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
author: "Anqi Zhu, Nana Matoba, Jason Stein, Michael Love"
output: 
  rmarkdown::html_document:
    self_contained: false
    highlight: pygments
abstract: |
  MRLocus provides estimation of gene-to-trait effects from eQTL and
  GWAS summary statistics, for loci with allelic heterogeneity:
  multiple nearly-LD-independent signal clusters representing distinct
  eQTLs. Evidence is aggregated across signal clusters for mediation
  of the downstream GWAS trait through the gene.
  MRLocus package version: `r packageVersion("mrlocus")`
bibliography: library.bib
vignette: |
  %\VignetteIndexEntry{MRLocus - estimation of gene-to-trait effects}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, echo=FALSE, results="hide"}
knitr::opts_chunk$set(cache=FALSE, message=FALSE, warning=FALSE)
```

# Introduction

MRLocus provides Bayesian analysis of gene-to-trait mediation effects,
making use of the Stan and RStan software packages for inference
[@stan;@rstan].

The authors of the associated [MRLocus manuscript](https://doi.org/10.1371/journal.pgen.1009455)
[@mrlocus] are:

> Anqi Zhu\*, Nana Matoba\*, Emma P. Wilson, Amanda L. Tapia,
> Yun Li, Joseph G. Ibrahim, Jason L. Stein, Michael I. Love

\* Both authors contributed equally to this work.

# Data input and preprocessing

In this vignette we demonstrate use of MRLocus on simulated summary
statistics. The simulated data in this vignette does not include
the steps of importing summary statistics and LD matrices from
external files. For that, we recommend looking at the R code we used
for analysis of the 
[public datasets](https://github.com/mikelove/mrlocusPaper/blob/master/public/mrlocus.R)
in the manuscript, and see also the basic data input information in
the README (also shown on the
MRLocus [homepage](https://mikelove.github.io/mrlocus)). 

Briefly, three important helper functions for import and preprocessing
that are provided in this package are `collapseHighCorSNPs`,
`flipAllelesAndGather`, and `trimClusters`, for importing summary
statistics across nearly-LD-independent signal clusters (sets of SNPs
that are non-overlapping and have low correlation in a reference
population). We use the phrase "LD-independent signal cluster" as in
the PTWAS software and paper [@ptwas]. 

It is strongly recommended to use these three functions for data import,
in particular as the first function will reduce the number of SNPs per
cluster and help deal with identifiability issues during the Bayesian
colocalization step. As described in the manuscript, MRLocus does not
attempt to resolve putative causal SNPs among sets of SNPs with
correlation $> 0.95$, but instead it collapses this redundancy using
`collapseHighCorSNPs`. The second function ensures that the alleles
are flipped properly across eQTL and GWAS datasets (the function will
handle this to avoid manual processing by the user), such that the
alleles are coded in such a way that the downstream statistical
modeling functions will expect. The second function ensures that both
studies refer to the same allele, which is the expression-increasing
allele for an index SNP within each nearly-LD-independent signal cluster.
Finally, the last function is used to ensure that the candidate
instruments have $r^2$ values below a specified threshold (so that
they are satisfying the "nearly-LD-independent" description).

We recommend using PLINK [@plink] for generation of signal clusters
(see Methods section of MRLocus manuscript). We have used raw p-value
thresholds from eQTL in order to cull lists of signal clusters
(instruments) which have an effect on the putative mediator (gene
expression). Note that we do not recommend running MRLocus on loci
with only one signal cluster, i.e. loci lacking allelic heterogeneity.
The `fitSlope` function will provide a warning if such input data
is provided. The estimation of the slope is impaired with a single
signal cluster, and estimation of the dispersion across signal
clusters is not possible.

We have not yet implemented, but plan on providing an import function
from the [TwoSampleMR](https://github.com/MRCIEU/TwoSampleMR) package.

Un-evaluated example code for the two preprocessing functions follows.
In the first function the arguments are:

* `sum_stat` - list of summary statistics over signal clusters
* `ld_mat` - list of LD matrices over signal clusters
* optionally, `ld_mat2` can also be provided for analyses
  with two populations

```{r eval=FALSE}
# (un-evaluated chunk)
data1 <- collapseHighCorSNPs(sum_stat, ld_mat)
```
The second function takes the output of the first function, and
additionally a number of column names. E.g. this function call 
implies that we have column names `ref_eqtl`, `effect_eqtl`,
`beta_eqtl`, `se_eqtl`, etc. See the function man page for more
details. If the column names are of the form `eqtl_beta`, `eqtl_se`,
etc. such that the A/B study specification is first in the column
names, the user should then set `ab_last=FALSE`.

```{r eval=FALSE}
# (un-evaluated chunk)
data2 <- flipAllelesAndGather(data1$sum_stat, data1$ld_mat,
                              a="eqtl", b="gwas",
                              ref="ref", eff="effect",
                              beta="beta", se="se",
                              a2_plink="major_plink",
                              snp_id="SNP", sep="_")
```

In order to inspect whether allele flipping worked as expected, this
function will by default make a plot of the initial estimates, or you
can manually run `plotInitEstimates` on the output of this
function. In the next section an evaluation of this plot is
discussed.

The third helper function, `trimClusters`, can be run after
`extractForSlope` (this occurs after colocalization).

# QC of input pre-colocalization

The following simulated summary statistics are generated from a
multivariate normal distribution, similar to the model used in
eCAVIAR [@eCAVIAR]. In contrast, the simulations in the MRLocus
manuscript make use of
the [twas_sim](https://github.com/mancusolab/twas_sim) framework 
[@twassim].

```{r}
library(mrlocus)
set.seed(1)
data <- makeSimDataForMrlocus()
```

This simulated data consists of a list of lists. The elements are: 
`beta_hat_a`, `beta_hat_b`, `se_a`, `se_b`, `Sigma_a`, `Sigma_b`, and
`alleles`, which provide the estimated coefficients, standard errors
(SE), and LD matrices for the two studies, A and B (typically, eQTL and
GWAS), as well as a data.frame with allele information (SNP ID,
reference allele, and effect allele).
The LD matrices can be identical if distinct matrices for each study
are not available. Each element itself should be a list of
vectors/matrices/data.frames across nearly-LD-independent signal clusters.

We can make a simple plot of the `beta_hat`'s, estimated coefficients
from A and B. The coefficients are numbered by the SNP order within
each signal cluster.

```{r initests, fig.height=7}
plotInitEstimates(data)
```

**Evaluating this plot:** 
The input data for colocalization, e.g. after running
`collapseHighCorSNPs` and `flipAllelesAndGather` on real eQTL and GWAS
summary statistics within PLINK clumps, is typically not more than
20-30 SNPs per clusters, and with a certain pattern of effect sizes
within each cluster. You would not expect to have SNPs *within a
cluster* having very discordant effects size in the x- or y-direction,
and quality control of the allele coding is important for the
colocalization step to give reasonable results. It is fine however for
their to be discordant effect sizes in the y-direction *across
clusters*, this is important evidence against a consistent mediation
effect.

# Colocalization step with MRLocus

The first step in MRLocus is to perform colocalization of the
studies. To do so, we loop over the signal clusters, and store the
fitted models in a list called `fit`.

```{r}
coloc_fit <- list()
nclust <- length(data$beta_hat_a)
lengths(data$beta_hat_a)
```

The following code chunk loops over signal clusters:

```{r}
options(mc.cores=2)
for (j in 1:nclust) {
  coloc_fit[[j]] <- with(data,
                         fitBetaColoc(
                           beta_hat_a = beta_hat_a[[j]],
                           beta_hat_b = beta_hat_b[[j]],
                           se_a = se_a[[j]],
                           se_b = se_b[[j]],
                           Sigma_a = Sigma_a[[j]],
                           Sigma_b = Sigma_b[[j]]
                         ))
}
```

We now create a list containing the posterior mean from the
colocalization step, and the original SE values. Note the SE values
are now referred to as `sd`. We can extract the list of
e.g. `beta_hat_a` using `lapply` and the `[[` extractor as seen below: 

```{r}
res <- list(beta_hat_a = lapply(coloc_fit, `[[`, "beta_hat_a"),
            beta_hat_b = lapply(coloc_fit, `[[`, "beta_hat_b"),
            sd_a = data$se_a,
            sd_b = data$se_b,
            alleles = data$alleles)
```

# Colocalization with eCAVIAR (optional alternative)

Alternatively, MRLocus can accept colocalization results from 
eCAVIAR [@eCAVIAR] as input to the slope fitting step. 
eCAVIAR is run with default options on each nearly-LD-independent
signal cluster separately, supplying the LD matrix,
and z-scores for study A and B, and `-c 1`, i.e. setting the
maximum number of causal SNPs to 1.

For eCAVIAR colocalization, the SNP with the largest colocalization
posterior probability (CLPP) is selected from each signal cluster,
among those SNPs which are valid instruments, i.e. with an absolute
value of z-score as large as the value used for PLINK clumping
[@plink]. eCAVIAR alleles are flipped such that the alternate
allele is the one corresponding to an increase in the measured
phenotype in A (e.g. gene expression).

# Slope fitting step

Following MRLocus colocalization, the following function,
`extractForSlope`, extracts one SNP per signal cluster with the
largest posterior mean for the coefficient for study A (e.g. eQTL). 
For eCAVIAR colocalization, a simple R script can be used to select
the SNP with largest CLPP per nearly-LD-independent signal cluster,
and to pass the estimated coefficients and SEs for A and B as vectors
in a named list to `fitSlope` (see man page for details on data
input, and the below code chunk for an example of the structure of the
input).

The selected SNPs and their posterior mean values are
plotted on the right panel in blue. We note that it is also possible
to perform EM-based clustering [@mclust] to extract more than one SNP
per signal cluster, through use of the `niter` argument, although this
was not evaluated in the MRLocus manuscript.

```{r extract}
res <- extractForSlope(res)
str(res) # input to fitSlope()
```

After running `extractForSlope`, we recommend to use the helper
function `trimClusters` to ensure that the instruments chosen for
slope modeling are nearly LD independent, by specifying an
`r2_threshold`.

Now we are ready to fit the slope. In our testing we have used 10,000
iterations, though fewer may be sufficient. This step is typically
much faster than the colocalization step.

```{r}
res <- fitSlope(res, iter=10000)
```

# Normalized allelic spread

We have found that a convenient statistic regarding the allelic
dispersion `sigma`, which is comparable across loci, is to normalize
by a typical value on the y-axis. We therefore normalize by the
mediated effect for the mean of instrument effect sizes. This can be
computed using the `normalizedAllelicSpread` function for a given fit
object:

```{r}
normalizedAllelicSpread(res)
```

# Examine MRLocus estimates

One of the elements in the results list `res` is the `stanfit` object,
which can be explored with functions from the `rstan` package.

The posterior mean for the gene-to-trait slope, alpha, is near the
simulated value of 0.5. The posterior mean for the dispersion around
the slope, sigma, is also estimated. Note that the estimate for sigma
has a large 80% credible interval, as we only have 4 signal clusters
over which to estimate dispersion.

```{r}
suppressPackageStartupMessages(library(rstan))
print(res$stanfit, pars=c("alpha","sigma"), probs=c(.1,.9), digits=3)
```

MRLocus provides a plot that summarizes the estimation of the
gene-to-trait effect. The estimated slope and an 80% credible interval
are indicated with a solid blue line and dashed blue lines,
respectively. The black points and error bars represent the posterior
mean for top SNPs from colocalization and the original SE.
An 80% band in light blue indicates a region that is
expected to contain 80% of the eQTL-GWAS pairs across signal
clusters, so reflecting the allelic heterogeneity. If the estimate of
$\sigma$ is of comparable size to $\alpha$ times typical eQTL effect
sizes (slope $\times$ location of black points on the x-axis), this
indicates a high degree of heterogeneity.

```{r mrlocus, fig.height=7}
plotMrlocus(res, main="MRLocus gene-to-trait effect estimate")
```

# Prior predictive check

Finally, we have a function which helps to assess the strength of the
prior for the slope (alpha) and the dispersion around the fitted line
(sigma). We generate a table showing the prior SD for these two
parameters and the posterior SD from the fitted model. We note that
some of the parameters of the prior (SD of `beta`, the instrument
effects, and SD of `alpha`, the slope) are informative by design (so
ratios > 0.1 are not a problem per se), but it is not desired that the
ratio of posterior to prior SD approach 1. 

We additionally generate two plots, the first plot generates parameters
from the prior and the model, showing the possible values for `alpha`,
`beta_a` and `beta_b` (the true effect sizes from the two studies). In
the second plot, `alpha` is fixed to the prior mean (by default, 0) to
better visually identify the simulated values for `sigma`. The blue
X's indicate the fitted estimates of `beta_a` and `beta_b` from the
colocalization step, and the blue lines indicate the posterior for
`alpha`.

```{r priorCheck}
priorCheck(res)
```

The default is for the function to return the small summary table,
but by setting `type=2`, it will alternatively return the prior
predictive draws themselves, with the number of draws specified by
`n` (default is 200).

# Session info

```{r}
sessionInfo()
```

# References