---
title: "MR-SimSS: The algorithm"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{MR-SimSS: The algorithm}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
### Introduction
A typical MR analysis is concerned with **estimating the causal effect, denoted by $\beta$, of a modifiable exposure, $X$ on a health-related outcome, $Y$**. When variables which confound the exposure-outcome relationship have not been observed, i.e. unmeasured confounders represented by $U$ are present, an MR study uses $n$ genetic variants, $G_1, G_2, \dots, G_n$, in an attempt to attain an unbiased causal effect estimate.
The image below displays a causal diagram representing the assumed relationship between genetic variant $G_j$, exposure $X$ and outcome $Y$ in an MR analysis. $U$ represents unobserved confounders, while $\gamma_j$ is the effect of the variant on the exposure and $\beta$ is the true exposure-outcome causal effect. For a particular individual, $G_j$ typically takes values 0, 1, or 2, $G_j \in \{0,1,2\}$,
depending on the minor allele count or genotype of the individual at genetic variant $j$.
**Symbol** | **What does it represent?** |
---|---|
$\beta$ | True causal effect of modifiable exposure on health-related outcome |
$\hat\beta$ | Estimated exposure-outcome causal effect |
$X$ | Exposure |
$Y$ | Outcome |
$U$ | Unmeasured variables which confound the exposure-outcome relationship |
$G_j$ | Genotype of genetic variant $j$, $G_j \in \{0,1,2\}$ |
$\gamma_j$ | True effect of genetic variant $j$ on the exposure |
$\hat\gamma_j$ | Estimated association between genetic variant $j$ and the exposure |
$\widehat{\gamma_{*_j}}$ | Estimated association between genetic variant $j$ and the exposure in fraction $*$ of the full dataset |
$\sigma_{X_j}$ | True standard error of the variant-exposure association estimate, $\widehat{\gamma_j}$, for variant $j$ |
$\widehat{\sigma_{X_j}}$ | Estimated standard error of the variant-exposure association estimate, $\widehat{\gamma_j}$, for variant $j$ |
$\widehat{\sigma_{X_{*,j}}}$ | Estimated standard error of the variant-exposure association estimate, $\widehat{\gamma_{*_j}}$, for variant $j$ in fraction $*$ of the full dataset |
$\widehat{\Gamma_j}$ | Estimated association between genetic variant $j$ and the outcome |
$\widehat{\Gamma_{*_j}}$ | Estimated association between genetic variant $j$ and the outcome in fraction $*$ of the full dataset |
$\sigma_{Y_j}$ | True standard error of the variant-outcome association estimate, $\widehat{\Gamma_j}$,for variant $j$ |
$\widehat{\sigma_{Y_j}}$ | Estimated standard error of the variant-outcome association estimate, $\widehat{\Gamma_j}$,for variant $j$ |
$\widehat{\sigma_{Y_{*,j}}}$ | Estimated standard error of the variant-outcome association estimate, $\widehat{\Gamma_j}$,for variant $j$ in fraction $*$ of the full dataset |
$n$ | Number of measured genetic variants |
$N$ | Number of participants in the full data set, i.e. number of individuals with genotype information, exposure and/or outcome measurements |
$\{\mathbf{G}, \mathbf{X},\mathbf{Y}\}$ | Full individual-level dataset in which $\mathbf{G}$ represents the matrix of genotype information for each genetic variant and participant, i.e. $\mathbf{G} = \{\mathbf{G}_1, \dots, \mathbf{G}_N\}\in \mathbb{R}^{N\times n}$, where $\mathbf{G}_j = [G_{1j}, \dots, G_{Nj}]^T$ contains the genotype information for the $j$th variant, $\mathbf{X} = [X_1, \dots,X_N]^T$ is the vector of measured exposures, and $\mathbf{Y} = [Y_1, \dots, Y_N]^T$ is the vector of outcome values |
$\{\mathbf{G}_{*}, \mathbf{X}_{*},\mathbf{Y}_{*}\}$ | Individual-level data set, similar to that above, but only containing data for those individuals in fraction $*$ of the full dataset |
$N_\text{overlap}$ | Number of overlapping individuals, i.e. number of participants who have measured values for both outcome and exposure |
$\pi$ | First fraction in which the full dataset is split into – fraction $\pi$ is used to select genetic instruments |
$p$ | Second fraction in which the full dataset is split into – fraction $1-\pi$ is split into fractions $p$ and $1-p$ to estimate variant-exposure and variant-outcome associations |
$\rho$ | Correlation between the exposure and the outcome, i.e. $\rho = \text{cor}(X,Y)$ |
$z_{X_{\pi,j}}$ | $z$-statistic of variant $j$ in fraction $\pi$ of the dataset, i.e. $z_{X_{\pi,j}} = \frac{\widehat{\gamma_{\pi_j}}}{\widehat{\sigma_{X_{\pi,j}}}}$ - used to select genetic instruments |
$n_\text{sig}$ | Number of genetic instruments which have been selected using the (simulated) summary statistics from the first fraction, $\pi$, of the dataset |
$N_\text{iter}$ | Number of iterations of sample splitting performed by MR-SimSS |
$\hat\beta^{(k)}$ | Causal effect estimate produced by MR-SimSS on iteration $k$ |
$\overline{\hat\beta}$ | Final estimate for the exposure-outcome causal effect supplied by MR-SimSS, i.e. average of the estimates generated at each iteration |
$\text{se}(\overline{\hat\beta})$ | Standard error of MR-SimSS causal effect estimate |
$\text{se}\left(\hat\beta^{(k)}\right)$ | Standard error of causal effect estimate at iteration $k$, generally supplied by chosen summary-level MR method |
$\lambda$ | Symbol representing $\frac{N_{\text{overlap}}\rho}{\sqrt{N_X N_Y}}$ in which $N_X$ is the number of individuals in the full dataset with measured exposures, i.e. exposure GWAS sample size, and $N_Y$ is the number of individuals in the full dataset with measured outcomes, i.e. outcome GWAS sample size |