Title: | MRLocus - Mendelian Randomization per locus |
---|---|
Description: | Mendelian Randomization per locus, leveraging eQTL and GWAS summary statistics, for estimation of gene-to-trait effect size and dispersion. |
Authors: | Anqi Zhu [aut], Nana Matoba [aut], Jason Stein [aut], Michael Love [aut, cre] |
Maintainer: | Michael Love <[email protected]> |
License: | GPL (>=2) |
Version: | 0.0.26 |
Built: | 2024-12-25 04:44:36 UTC |
Source: | https://github.com/thelovelab/mrlocus |
Mendelian Randomization per locus, leveraging eQTL and GWAS summary statistics, for estimation of gene-to-trait effect size and dispersion.
The main functions (in order of typical usage) are:
collapseHighCorSNPs
- collapse high correlation SNPs
flipAllelesAndGather
- flip alleles and gather for colocalization
fitBetaColoc
- perform colocalization across signal clusters
extractForSlope
- extract one SNP per signal cluster for MR analysis
fitSlope
- perform MR analysis to estimate gene-to-trait effect
plotMrlocus
- plot estimates from MR analysis
priorCheck
- perform prior predictive checks
Anqi Zhu*, Nana Matoba*, Emmaleigh Wilson, Amanda L. Tapia, Yun Li, Joseph G. Ibrahim, Jason L. Stein, Michael I. Love. MRLocus: identifying causal genes mediating a trait through Bayesian estimation of allelic heterogeneity. (2020) bioRxiv https://doi.org/10.1101/2020.08.14.250720
Stan Development Team (2019). RStan: the R interface to Stan. R package version 2.19.2. https://mc-stan.org
A helper function to collapse sets of highly correlated
SNPs within signal clusters. This is recommended to run
before flipAllelesAndGather
, and before
fitBetaColoc
.
collapseHighCorSNPs( sum_stat, ld_mat, ld_mat2 = NULL, threshold = 0.95, score = NULL, plot = TRUE, snp_id = NULL )
collapseHighCorSNPs( sum_stat, ld_mat, ld_mat2 = NULL, threshold = 0.95, score = NULL, plot = TRUE, snp_id = NULL )
sum_stat |
list of summary statistic tables,
which is a list over signal clusters.
Each element of the list should be a data.frame
describing the eQTL and GWAS summary statistics.
The only column in |
ld_mat |
list of LD matrices across signal clusters |
ld_mat2 |
optional second list of LD matrices
(for different populations). it will be returned
alongside the first |
threshold |
threshold on absolute value of correlation for collapsing, e.g. will collapse if SNPs are more correlated (or anti-correlated) than this amount |
score |
name of a column of sum_stat data.frames with a score, such that collapsing will choose the highest score SNP per collapsed cluster. Otherwise, if set to NULL, the first SNP will be used |
plot |
logical, draw a before/after grid of plots |
snp_id |
name of SNP id column in |
list with subset ld_mat
and sum_stat
lists (and ld_mat2
if provided)
Extracts one or more SNPs from each signal cluster
based on the posterior estimate of the effect size
for A (largest effect size in the positive direction).
After running this function, it is recommended to use
trimClusters
to remove signal clusters
that are too highly correlated.
extractForSlope( res, niter = 0, plot = TRUE, label = "Effect size of", a = "eQTL", b = "GWAS" )
extractForSlope( res, niter = 0, plot = TRUE, label = "Effect size of", a = "eQTL", b = "GWAS" )
res |
list with the following named elements:
|
niter |
number of iterations of EM to run for mclust, if set to 0, only the maximum variant (in terms of A effect size) per signal cluster is output. Default is to not run clustering, but to take the SNP with the largest effect size in A (in the positive direction) |
plot |
logical, draw a before after of which variants will be included for slope estimation |
label |
what preceeds |
a |
name of A experiment |
b |
name of B experiment |
list of vectors of the first four arguments,
collapsed now across signal clusters, representing
variants with positive effect on A. So the null variants
have been removed (and any variants per cluster that
indicated a negative effect on A). If alleles
data.frames were included in the input, they will
also be passed through as a single data.frame with the
selected SNPs per signal cluster
Implements the MRLocus colocalization step, in which summary
statistics from A and B studies are used in a generative
model, with a horseshoe prior on the latent true effect sizes.
Posterior effect sizes and posterior SD are returned,
although in pratice we use the original SE in the subsequent
fitSlope
model.
See Supplementary Methods of the MRLocus manuscript.
fitBetaColoc(beta_hat_a, beta_hat_b, se_a, se_b, Sigma_a, Sigma_b, ...)
fitBetaColoc(beta_hat_a, beta_hat_b, se_a, se_b, Sigma_a, Sigma_b, ...)
beta_hat_a |
vector of estimated coefficients for A |
beta_hat_b |
" " for B |
se_a |
vector of standard errors for beta_hat_a |
se_b |
" " for beta_hat_b |
Sigma_a |
correlation matrix of SNPs for A |
Sigma_b |
" " for B (this could be different for different LD matrix) |
... |
further arguments passed to rstan::sampling |
a list with the following elements:
stanfit
object
posterior means for estimated coefficients for A and B
posterior standard deviations for A and B
scaling factors for A and B
Two important notes: (1) in MRLocus manuscript, original
SE are used instead of posterior SD in the slope fitting step,
(2) the posterior means and SD for estimated coefficients
are appropriately scaled, while the results from the
stanfit
object are not scaled. In order to scale the
results from the stanfit object, scale_a
and scale_b
should be divided out from both coefficients and SDs
(see Supplementary Methods).
Anqi Zhu*, Nana Matoba*, Emma P. Wilson, Amanda L. Tapia, Yun Li, Joseph G. Ibrahim, Jason L. Stein, Michael I. Love. MRLocus: identifying causal genes mediating a trait through Bayesian estimation of allelic heterogeneity. (2021) PLOS Genetics https://doi.org/10.1371/journal.pgen.1009455
Implements the MRLocus slope fitting step, in which the estimated
coefficients and their original SE are used to determine
the mediation slope (alpha
), and the dispersion of individual
signal clusters around the slope (sigma
). This function
follows the colocalization step fitBetaColoc
and extractForSlope
.
The output fitSlope
can be visualized with
plotMrlocus
. For details on the model,
see Supplementary Methods of the MRLocus manuscript.
See vignette for example of model interpretation.
fitSlope( res, sd_beta = NULL, mu_alpha = NULL, sd_alpha = NULL, sd_sigma = NULL, ... )
fitSlope( res, sd_beta = NULL, mu_alpha = NULL, sd_alpha = NULL, sd_sigma = NULL, ... )
res |
list with the following named elements:
|
sd_beta |
prior SD for beta A (default value will be derived from data) |
mu_alpha |
prior mean for |
sd_alpha |
prior SD for |
sd_sigma |
prior SD for |
... |
further arguments passed to |
Note that since v0.0.26, the default prior mean for alpha is 0, whereas in the paper it was derived from the data.
Note that if summary statistics for only one SNP are provided a warning will be printed (this is not a recommended use of MRLocus) and a parametric simulation is used to estimate the slope, instead of the Bayesian model.
list with the following elements: stanfit
object,
original estimated coefficients and standard deviations,
as well as the alleles
data.frame (if it was provided)
Anqi Zhu*, Nana Matoba*, Emma P. Wilson, Amanda L. Tapia, Yun Li, Joseph G. Ibrahim, Jason L. Stein, Michael I. Love. MRLocus: identifying causal genes mediating a trait through Bayesian estimation of allelic heterogeneity. (2021) PLOS Genetics https://doi.org/10.1371/journal.pgen.1009455
A helper function to flip alleles from eQTL and GWAS
datasets, such that they agree on the effect allele,
that the SNPs in a signal cluster are in postive
correlation with the index SNP (eQTL), and that the
effect allele is coded such that it is the expression
increasing allele.
This is recommended to run after
collapseHighCorSNPs
, and before
fitBetaColoc
.
flipAllelesAndGather( sum_stat, ld_mat, ld_mat2 = NULL, a, b, ref, eff, beta, se, a2_plink, a2_plink_mat2 = NULL, snp_id, sep, ab_last = TRUE, alleles_same = FALSE, plot = TRUE )
flipAllelesAndGather( sum_stat, ld_mat, ld_mat2 = NULL, a, b, ref, eff, beta, se, a2_plink, a2_plink_mat2 = NULL, snp_id, sep, ab_last = TRUE, alleles_same = FALSE, plot = TRUE )
sum_stat |
list of summary statistic tables.
A list over signal clusters, where each element is a
data.frame with summary statistics from eQTL and GWAS
datasets. The names of the columns are specified by
arguments below (e.g. |
ld_mat |
list of LD matrices |
ld_mat2 |
optional second list of LD matrices
(for different populations). it will be returned
alongside the first |
a |
name of A in columns of |
b |
name of B ("GWAS") |
ref |
name of reference allele |
eff |
name of effect allele |
beta |
name of estimated coefficient |
se |
name of standard error |
a2_plink |
name of the column representing the a2 allele (reference allele) according to plink. the default for plink v1.9 and earlier was to reset a2 to the major allele, unless an optional flag was used. in plink v2.0 and onward, one should check to see which allele is used as reference for calculating the LD matrix. If plink was not used, this argument should just point to the reference allele that was used for LD calculation |
a2_plink_mat2 |
name of the column representing
the a2 allele for the second LD matrix, |
snp_id |
name of SNP id |
sep |
character separator in column names that involve A/B |
ab_last |
logical, A/B descriptor is last in column names (e.g. "beta_eqtl", "se_eqtl")) |
alleles_same |
logical, A/B/LD matrix alleles are identical |
plot |
logical, draw a scatterplot of the flipped betas |
list with estimated coefficients, standard errors, LD matrix, and alleles data.frame
Make simple simulated summary data
makeSimDataForMrlocus( nsnp = c(7:10), idx = 5, alpha = 0.5, sigma = 0.05, betas = 1:4, se = 0.25, n_mult = 1 )
makeSimDataForMrlocus( nsnp = c(7:10), idx = 5, alpha = 0.5, sigma = 0.05, betas = 1:4, se = 0.25, n_mult = 1 )
nsnp |
number of SNPs per signal cluster |
idx |
the causal SNP (same per cluster for simplicity) |
alpha |
the true slope of B coefficients over A coefficients |
sigma |
the SD of true B coefficients around the conditional values given true A coefficients |
betas |
the true A coefficients |
se |
the standard errors for betas |
n_mult |
how many more samples the B study has |
a list of beta_hat_a
, beta_hat_b
,
se_a
, se_b
,
Sigma_a
, Sigma_b
(themselves lists), and alleles
(a list of data.frames each with
id
, ref
, eff
for the SNP id,
reference allele, and effect allele).
A useful statistic is to normalize the estimate of
sigma from estimateSlope
, the dispersion,
with respect to typical mediated effect sizes.
This helps in particular to compare sigma across
different GWAS traits, which may have different
scale.
normalizedAllelicSpread(fit)
normalizedAllelicSpread(fit)
fit |
a list output by |
We define the normalized allelic spread as an estimate of sigma
divided by something called the "mean mediated effect".
The mean mediated effect is defined as the mean
QTL effect size multiplied by the estimate of the slope.
So in the plotMrlocus
plot, the mean mediated
effect is found by taking the mean among the
spread of points on the x-axis, then going up to the
linear trend. This function outputs the normalized
allelic spread which is again: sigma / mean mediated
effect.
the normalized allelic spread, a numeric
Plot initial estimates over signal clusters
plotInitEstimates(x, label = "Effect size of", a = "eQTL", b = "GWAS")
plotInitEstimates(x, label = "Effect size of", a = "eQTL", b = "GWAS")
x |
list of signal clusters data with |
label |
what preceeds |
a |
name of A experiment |
b |
name of B experiment |
Plot estimates from MRLocus slope fitting step
plotMrlocus( res, q = c(0.1, 0.9), sigma_mult = 1.28, label = "Effect size of", a = "eQTL", b = "GWAS", xlim = NULL, ylim = NULL, legend = TRUE, digits = 3, col_slope = "blue", col_band = rgb(0, 0, 1, 0.1), col_dashed = rgb(0, 0, 1, 0.5), ... )
plotMrlocus( res, q = c(0.1, 0.9), sigma_mult = 1.28, label = "Effect size of", a = "eQTL", b = "GWAS", xlim = NULL, ylim = NULL, legend = TRUE, digits = 3, col_slope = "blue", col_band = rgb(0, 0, 1, 0.1), col_dashed = rgb(0, 0, 1, 0.5), ... )
res |
the output from |
q |
the quantiles of the posterior to use for drawing the uncertainty on the slope. The default is an 80 percent interval |
sigma_mult |
multiplier on estimate of sigma
for drawing the dispersion band
(e.g. |
label |
what preceeds |
a |
name of A experiment |
b |
name of B experiment |
xlim |
xlim (if NULL will be set automatically) |
ylim |
ylim (if NULL will be set automatically) |
legend |
logical, whether to show a legend |
digits |
number of digits to show in legend |
col_slope |
the color of the slope (alpha) |
col_band |
the color of the band |
col_dashed |
the color of the dashed lines |
... |
arguments passed to |
This function provides some basic checks on the strength of the prior in the MRLocus slope fitting Bayesian model. It is not desired that the prior overly influences the posterior inference.
priorCheck(res, n = 200, plot = TRUE, type = 1)
priorCheck(res, n = 200, plot = TRUE, type = 1)
res |
output of |
n |
integer, for the plot how many data points to simulate |
plot |
logical, whether to draw the plots |
type |
integer, return type. by default ( |
The posterior-over-prior SD ratio is calculated
and returned in a table, and two plots are made
that show parameters drawn from the estimated
priors (in MRLocus, two priors are estimated from the data -
the SD of beta, the instrument effects, and the SD of alpha, the slope).
Alternatively, the prior predictive draws themselves can
be returned instead of the table (by setting type=2
).
If the posterior-over-prior SD ratio is close to 1 for either alpha or sigma, this indicates undesirable influence of the prior on the posterior inference. For comparison, some consider a posterior-prior SD ratio of 0.1 or higher to be described as an 'informative prior' (from Stan wiki on prior choice recommendations). We note that an 'informative prior' alone is not problematic for MRLocus, and the prior estimation steps have been designed to be informative as to reasonable values for some of the prior parameters of alpha and sigma.
The plots show parameters generated
from the prior and the model. The simulated true values of
beta_a
and beta_b
are drawn as black
circles (summary statistics would then be drawn from
these according to the reported SEs, but this step
of the model is omitted in this plot).
The two plots differ in that the second plot uses fixed
alpha (fixed to the prior mean) instead of drawing it from the model
(so that the prior for sigma can better be visualized).
The fitted estimates of beta_a
and beta_b
from the colocalization step are shown as blue X's.
One exception where parameters are not drawn from the prior is:
beta_a
values are instead drawn as uniform
between 0 and 1.1x the maximum value of beta_hat_a
from the colocalization step (for ease of visualization).
a data.frame with information about prior and posterior SD for alpha and sigma, and two plots are generated (see Details)
This function identifies the clusters to remove
such that all remaining clusters have pairwise
r2 below a given threshold. It is assumed the r2
matrix is ordered such that the first row/column
corresponds to the most significant signal cluster,
and so on.
trimClusters either removes
clusters correlated with the most significant
cluster (first) and proceeds downwards,
or removes clusters with correlation to more
significant clusters and proceeds upwards.
Moving downward tends to preserve more clusters.
The function stops once
the pairwise r2 are all below the threshold.
It is recommended to run this function after
extractForSlope
.
trimClusters(r2, r2_threshold, direction = "down")
trimClusters(r2, r2_threshold, direction = "down")
r2 |
the matrix of r2 values |
r2_threshold |
the threshold on r2 |
direction |
to proceed from the first
cluster down ( |
a numeric vector (possibly length 0) of the signal clusters that should be trimmed/removed