--- title: "MR-SimSS: The algorithm" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{MR-SimSS: The algorithm} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ### Introduction A typical MR analysis is concerned with **estimating the causal effect, denoted by $\beta$, of a modifiable exposure, $X$ on a health-related outcome, $Y$**. When variables which confound the exposure-outcome relationship have not been observed, i.e. unmeasured confounders represented by $U$ are present, an MR study uses $n$ genetic variants, $G_1, G_2, \dots, G_n$, in an attempt to attain an unbiased causal effect estimate. The image below displays a causal diagram representing the assumed relationship between genetic variant $G_j$, exposure $X$ and outcome $Y$ in an MR analysis. $U$ represents unobserved confounders, while $\gamma_j$ is the effect of the variant on the exposure and $\beta$ is the true exposure-outcome causal effect. For a particular individual, $G_j$ typically takes values 0, 1, or 2, $G_j \in \{0,1,2\}$, depending on the minor allele count or genotype of the individual at genetic variant $j$.


[$\star$]{style="color: darkblue;"} **Note:** A table is included at the bottom of the page ([here](https://amandaforde.github.io/mr.simss/articles/MR-SimSS-algorithm.html#notation)) detailing all notation used in this vignette. Here, the focus is on **summary-level MR**, a form of MR in which the analysis is conducted with summary statistics that have been generated using two large non-overlapping, or partially overlapping, exposure and outcome data sets. These summary statistics typically take the form of variant-exposure association estimates, variant-outcome association estimates and their corresponding standard errors, i.e. $\left( \widehat{\gamma_j}, \widehat{\sigma_{X_j}}\right)$ and $\left( \widehat{\Gamma_j}, \widehat{\sigma_{Y_j}}\right)$ for each genetic variant $j = 1, \dots, n$. To be more specific, the exposure data is used to regress $X$ on each $G_j$, $j = 1, \dots, n$, independently, yielding regression coefficients, $\widehat{\gamma_j}$ and standard errors, $\widehat{\sigma_{X_j}}$ in the process. In a similar manner, the regression coefficients, $\widehat{\Gamma_j}$ and standard errors, $\widehat{\sigma_{Y_j}}$ are generated by regressing $Y$ on each $G_j$ separately in the outcome data set. [$\star$]{style="color: darkblue;"} **Potential sample overlap is permitted** between the two data sets and therefore, it is not assumed that $\widehat{\gamma_j}$ and $\widehat{\Gamma_j}$ are independent for each genetic variant $j$, i.e. it is possible that $\text{cov}\left( \widehat{\gamma_j}, \widehat{\Gamma_j} \right) \neq 0$. ### Intuition for MR-SimSS framework The intuition for MR-SimSS stems from a solution to *Winner’s Curse* that could be implemented when individual-level data is available. #### Individual-level: In this setting, we first suppose that we have individual-level data, i.e. genotype information, exposure measurements and/or outcome measurements, for a total of $N$ participants of the form $\{\textbf{G}, \textbf{X}, \textbf{Y}\}$. Here, $\textbf{G}$ represents the matrix of genotype information for each genetic variant and participant, i.e. $\textbf{G} = \{ \textbf{G}_1, \dots, \textbf{G}_{N} \} \in \mathbb{R}^{N \times n}$, where $\textbf{G}_{j} = \left[ G_{{1_j}}, \dots,G_{N_j}\right]^T$ contains the genotype information for the $j$th variant, and $\textbf{X} = \left[ X_1, \dots, X_{N} \right]^T$ is the vector of measured exposures. Similarly, $\textbf{Y} = \left[ Y_1, \dots, Y_{N} \right]^T$ is the vector of outcome values. It is possible that the vectors $\mathbf{X}$ and $\mathbf{Y}$ may have several missing values, as there may only be exposure or outcome values, but not both, available for certain individuals. We let $N_\text{overlap}$ denote the number of overlapping individuals, those who have measured values for both outcome and exposure. The proportion of sample overlap, $\frac{N_\text{overlap}}{N}$, can range from 0 to 1, i.e. no overlap to full overlap. [$\star$]{style="color: darkblue;"} **Eliminating** ***Winner's Curse***: If this full data set $\{\mathbf{G},\mathbf{X},\mathbf{Y}\}$ were available, in order to avoid *Winner’s Curse*, we could first **randomly split it into two subsets**, with approximately $\pi N$ and $(1−\pi)N$ individuals in each set, respectively. Following this splitting, an exposure GWAS could be performed using the individual-level data in the first fraction or subset with $\pi N$ individuals, $\{\mathbf{G}_\pi,\mathbf{X}_\pi,\mathbf{Y}_\pi \}$, yielding variant-exposure association estimates, $\widehat{\gamma_{\pi_j}}$for each variant $j = 1, \dots,n$. These estimated associations could then be used for the sole purpose of selecting genetic instruments. This would ensure that *Winner’s Curse* is avoided as the other independent subset, $\{\mathbf{G}_{1-\pi},\mathbf{X}_{1-\pi},\mathbf{Y}_{1-\pi}\}$, would be used to perform an exposure and outcome GWAS, generating variant-exposure, $\widehat{\gamma_{{1-\pi}_j}}$, and variant-outcome association estimates, $\widehat{\Gamma_{{1-\pi}_j}}$, for the MR study. This process is depicted in the figure below.


[$\star$]{style="color: darkblue;"} **Weak instrument bias**: Recall from the first vignette, ["Winner's Curse and weak instrument bias in MR"](https://amandaforde.github.io/mr.simss/articles/wc_wib.html), that if the variant-exposure and variant-outcome association estimates used in the MR analysis are derived from a single sample or two largely overlapping samples and if several weak instruments are incorporated, then bias towards the confounded observational association will be introduced into the MR causal effect estimate. When individual-level data is available, a potential solution to this as well as *Winner's Curse*, would be to consider **splitting the $1- \pi$ fraction into two fractions, $p$ and $1-p$**. As illustrated in the image below, the dataset $\{\mathbf{G}_{(1-\pi)(p)},\mathbf{X}_{(1-\pi)(p)},\mathbf{Y}_{(1-\pi)(p)}\}$ could be used to perform an exposure GWAS and supply variant-exposure association estimates, $\widehat{\gamma_{{(1-\pi)(p)}_j}}$, while an outcome GWAS could be conducted with $\{\mathbf{G}_{(1-\pi)(1-p)},\mathbf{X}_{(1-\pi)(1-p)},\mathbf{Y}_{(1-\pi)(1-p)}\}$, giving variant-outcome association estimates, $\widehat{\Gamma_{{(1-\pi)(1-p)}_j}}$. A suitable two-sample summary-level MR method could then be applied to the resulting summary statistics: $\{ \widehat{\gamma_{{(1-\pi)(p)}_j}}, \widehat{\Gamma_{{(1-\pi)(1-p)}_j}} \}$.


Unfortunately, this splitting process, such as the one illustrated above, results in a great loss in sample size, and hence power. However, a reduction in the variability of the resulting MR causal effect estimate can be acquired by **repeating the random splitting and performing GWASs in each subset many, many times**, and then, averaging over the causal effect estimates obtained on each iteration. However, this is **not a very feasible solution**, as in addition to requiring individual-level data, a major limitation of this approach is its severe computational demands, in terms of both run time and memory usage.
[$\star$]{style="color: darkblue;"} **MR-SimSS seeks to imitate this repeated sample splitting process, using only GWAS summary statistics.** ### How does MR-SimSS work? As MR-SimSS works with **summary-level data**, it assumes that an exposure and outcome GWAS have been performed, providing variant-exposure and variant-outcome association estimates, $\{\widehat{\gamma_{j}}, \widehat{\Gamma_j}\}$, in which there may be a number of overlapping individuals between the two GWASs, $N_\text{overlap}$.
  1. Constructed with the intention to imitate the above repeated sample splitting, **MR-SimSS first considers splitting the dataset into two fractions**, $\pi$ and $1-\pi$. Values are simulated for the association estimates in the first fraction, $\{\widehat{\gamma_{{\pi}_j}}, \widehat{\Gamma_{{\pi}_j}}\}$, conditional on those in the full dataset, $\{\widehat{\gamma_{j}}, \widehat{\Gamma_j}\}$, i.e. values are simulated for $\{\widehat{\gamma_{{\pi}_j}}, \widehat{\Gamma_{{\pi}_j}}\}$, using the **asymptotic conditional distribution** of $\begin{bmatrix} \widehat{\gamma_{\pi}} \\ \widehat{\Gamma_{\pi}} \end{bmatrix} \Bigm| \begin{bmatrix} \widehat{\gamma} \\ \widehat{\Gamma} \end{bmatrix}$. For genetic variant $j$, the asymptotic conditional distribution has been derived as: $$ \left( \begin{bmatrix} \widehat{\gamma_{\pi}} \\ \widehat{\Gamma_{\pi}} \end{bmatrix} \Bigm| \begin{bmatrix} \widehat{\gamma} \\ \widehat{\Gamma} \end{bmatrix} \right) \sim N \left( \begin{bmatrix} \widehat{\gamma} \\ \widehat{\Gamma} \end{bmatrix}, \left( \frac{1-\pi}{\pi} \begin{bmatrix} \sigma^2_X & \sigma_X \sigma_Y \frac{N_\text{overlap} \rho}{N_X N_Y}\\ \sigma_X \sigma_Y \frac{N_\text{overlap} \rho}{N_X N_Y} & \sigma^2_Y\end{bmatrix} \right)\right) $$ Here, $\rho$ represents the correlation between the exposure and the outcome, $\rho = \text{cor}(X,Y)$, which may differ from $\beta$, the true exposure-outcome causal effect, due to confounding. A detailed outline of the derivation of this asymptotic distribution can be found [here](https://amandaforde.github.io/mr.simss/articles/derive-MR-SimSS.html).
  2. The **estimated standard errors** of the simulated association estimates in the first fraction, $\{ \widehat{\sigma_{X_{\pi,j}}},\widehat{\sigma_{Y_{\pi,j}}}\}$ can be obtained by approximating then with $\sigma_{X_{\pi,j}} \sim \sqrt{\frac{1}{\pi}} \cdot \widehat{\sigma_{X_{j}}}$ and $\sigma_{Y_{\pi,j}} \sim \sqrt{\frac{1}{\pi}} \cdot \widehat{\sigma_{Y_{j}}}$, respectively.
  3. The **genetic instruments are selected** according to their $z$-statistics, $z_{X_{\pi,j}} = \frac{\widehat{\gamma_{\pi_j}}}{\widehat{\sigma_{X_{\pi,j}}}}$ in fraction $\pi$, i.e. genetic variants are chosen to be incorporated into the MR analysis which have corresponding $p$-values less than the genome-wide significance threshold of $5 \times 10^{-8}$.
  4. For each genetic variant, the association estimates that would have been obtained if GWASs were conducted in the second fraction, $1−\pi$, can be generated using $\widehat{\gamma_j} \approx \pi\widehat{\gamma_{\pi_j}} + (1-\pi)\widehat{\gamma_{{(1-\pi)}_j}}$ and $\widehat{\Gamma_j} \approx \pi\widehat{\Gamma_{\pi_j}} + (1-\pi)\widehat{\Gamma_{{(1-\pi)}_j}}$. Note that these relationships hold due to the **asymptotic linearity of maximum likelihood estimates** but therefore, are only approximate to order $\frac{1}{N}$. The corresponding standard errors in the $1-\pi$ fraction are similarly approximated by $\widehat{\sigma_{X_{1-\pi_j}}} \sim \sqrt{\frac{1}{1-\pi}} \cdot \widehat{\sigma_{X_{j}}}$ and $\widehat{\sigma_{Y_{1-\pi_j}}} \sim \sqrt{\frac{1}{1-\pi}} \cdot \widehat{\sigma_{Y_{j}}}$.
  5. Applying a **summary-level MR method** of choice, such as the IVW method, using the summary statistics $\{\widehat{\gamma_{{1-\pi}_j}}, \widehat{\sigma_{X_{1-\pi_j}}}, \widehat{\Gamma_{{1-\pi}_j}}, \widehat{\sigma_{Y_{1-\pi_j}}} \}$ of selected genetic variants, i.e. those that satisfy the condition $\left| z_{X_{\pi_j}} \right| > \Phi^{-1}\left(1- \frac{5 \times 10^{-8}}{2}\right)$, describes the **2-split version of MR-SimSS**. Steps 1-5 are illustrated in the figure below.


However, as mentioned above, if there exists a large degree of overlap between the sample sets used in the exposure and outcome GWASs, then using association estimates derived from the same fraction is likely to induce a notable amount of **weak instrument bias** in the MR estimate **in the direction of the confounded observational exposure-outcome association**. Therefore, a **3-split version of MR-SimSS** was proposed that would eliminate *Winner's Curse* bias, and only result in weak instrument bias towards the null, irrespective of the proportion of sample overlap.
  1. After carrying out steps 1-4 above, the 3-split version of MR-SimSS imitates the **splitting of individual-level data in the $1-\pi$ fraction** into two fractions, $p$ and $1− p$. Values are simulated for the estimated variant-exposure and variant-outcome associations, $\{\widehat{\gamma_{{(1-\pi)p}_j}}, \widehat{\Gamma_{{(1-\pi)p}_j}}\}$, respectively, conditional on the generated association estimates in the $1-\pi$ fraction, $\{\widehat{\gamma_{{(1-\pi)}_j}}, \widehat{\Gamma_{{(1-\pi)}_j}}\}$. The **asymptotic conditional distribution** of $\begin{bmatrix} \widehat{\gamma_{(1-\pi)p}} \\ \widehat{\Gamma_{(1-\pi)p}} \end{bmatrix} \Bigm| \begin{bmatrix} \widehat{\gamma_{1-\pi}} \\ \widehat{\Gamma_{1-\pi}} \end{bmatrix}$ takes a similar format to the conditional distribution included above - an expression for it can be found [here](https://amandaforde.github.io/mr.simss/articles/derive-MR-SimSS.html).
  2. Using the **asymptotic linearity of maximum likelihood estimates**, $\widehat{\Gamma_{{(1-\pi)(1-p)}_j}}$ is obtained using $\widehat{\Gamma_{{(1-\pi)}_j}} \approx p\widehat{\Gamma_{{(1-\pi)p}_j}}-(1-p)\widehat{\Gamma_{{(1-\pi)(1-p)}_j}}$, with the estimated standard errors $\widehat{\sigma_{X_{(1-\pi)p_j}}}$ and $\widehat{\sigma_{X_{(1-\pi)(1-p)_j}}}$ again approximated in a similar manner to previously.
  3. A **summary-level MR method** of choice, such as the IVW or MR-RAPS methods is applied to the summary statistics $\{\widehat{\gamma_{{(1-\pi)}p_j}}, \widehat{\sigma_{X_{(1-\pi)p_j}}}, \widehat{\Gamma_{{(1-\pi)(1-p)}_j}}, \widehat{\sigma_{Y_{(1-\pi)(1-p)_j}}} \}$ of genetic variants $j=1, \dots, n_{\text{sig}}$ that have been selected in the first $\pi$ fraction, i.e. those that satisfy the condition $\left| z_{X_{\pi_j}} \right| > \Phi^{-1}\left(1- \frac{5 \times 10^{-8}}{2}\right)$. This process is depicted in the figure below.


  1. In order to reduce the variability in the final MR-SimSS causal effect estimate and ensure its stability, we suggest **repeating this simulated sample splitting process many times**, at least $N_\text{iter} = 1,000$ iterations.
  2. The **final value for $\hat\beta$**, the estimated exposure-outcome causal effect, supplied by MR-SimSS is equivalent to the **average of the estimates generated by each iteration** of the process, i.e. $\overline{\hat\beta} = \frac{1}{N_{\text{iter}}} \sum_{k=1}^{N_{\text{iter}}} \hat\beta^{(k)}$. Here, the estimate of the causal effect obtained at iteration $k$ of the simulated sample splitting process is denoted by $\hat\beta^{(k)}$.
  3. A suitable value for $\text{se}(\overline{\hat\beta})$, the **standard error of the MR-SimSS causal effect estimate**, can be obtained using the following expression: $$\text{se}(\overline{\hat\beta}) = \sqrt{\frac{\sum_{k=1}^{N_{\text{iter}}} \left[ \left( \text{se}\left( \hat\beta^{(k)} \right)\right)^2 \right] - \sum_{k=1}^{N_{\text{iter}}} \left[ \hat\beta^{(k)} - \overline{\hat\beta}\right]^2}{N_\text{iter}}}$$ Note that $\text{se}\left(\hat\beta^{(k)}\right)$ is the standard error of the causal effect estimate at iteration $k$, which is typically provided by the summary-level MR method of choice. More details regarding the expression for $\text{se}(\overline{\hat\beta})$ can be found [here](https://amandaforde.github.io/mr.simss/articles/derive-MR-SimSS.html).

[$\star$]{style="color: darkblue;"} **Note:** The asymptotic conditional distribution equation, given in Step 1, suggests that in order to implement MR-SimSS, if the number of overlapping samples involved in the exposure and outcome GWASs, $N_\text{overlap}$, is known to differ from zero, $N_\text{overlap} \neq 0$, then this number $N_\text{overlap}$ as well as the correlation between the exposure and outcome, $\rho$, must be provided. In general, $N_\text{overlap}$ may be known, especially if the GWASs have been performed on a single fully overlapping sample, and an estimate for $\rho$ could be obtained from the literature. However, especially when only GWAS summary statistics are available, it can often be difficult to know the value of $N_\text{overlap}$ or $\rho$. Therefore, within its functionality, MR-SimSS also has an **alternative strategy which seeks to estimate a value for $\lambda = \frac{N_\text{overlap}\rho}{\sqrt{N_X N_Y}}$**, using only the available **summary-level data of the genome-wide set of genetic variants** i.e. $\{ \widehat{\gamma_j}, \widehat{\sigma_{X_j}}, \widehat{\Gamma_j}, \widehat{\sigma_{Y_j}}\}$. Additional details regarding this approach, which is based on conditional likelihood, can be found [here](https://amandaforde.github.io/mr.simss/articles/derive-MR-SimSS.html#estimating-the-correlation).
[$\star$]{style="color: darkblue;"} **Note:** Unfortunately, repeatedly simulating values for $\{\gamma_{\pi_j}, \Gamma_{\pi_j}\}$ and $\{\gamma_{{(1-\pi)p}_j}, \Gamma_{{(1-\pi)p}_j}\}$ for thousands of genetic variants can make MR-SimSS a very computationally intensive process, especially in comparison with other summary-level MR methods. By recognising that irrespective of the genetic architecture of the exposure, there will likely be many variants which will never be selected across the $N_\text{iter}$ iterations, MR-SimSS contains a **computationally efficient feature** which first establishes a **smaller subset of genetic variants** that will then be entered into the main algorithm. This subset of genetic variants is constructed by first ordering all variants according to their probabilities, $f\left( z_{X_{\pi,j}} > 5.45\right)$, from smallest to largest and then, obtaining the cumulative sum of these probabilities for each variant. All SNPs with a cumulative probability greater than 0.05 are included in the subset. This process ensures that for a single iteration of MR-SimSS, the number of genetic instruments if the full set of SNPs is used and the number of instruments if merely the subset is used will be equal with probability at least 95\%. Further details regarding can be found [here](https://amandaforde.github.io/mr.simss/articles/derive-MR-SimSS.html#reducing-the-set-of-variants).
### Assumptions made by MR-SimSS As with many statistical methods, the MR-SimSS framework makes **several assumptions**: ```{css, echo=FALSE} table, th, td { border: 1px solid black; font-size: 12.5px; } th, td { padding: 8px; } td:nth-child(odd){ text-align: center; } th { background-color: #38d1a4; } ```
### Notation
**Symbol** **What does it represent?**
$\beta$ True causal effect of modifiable exposure on health-related outcome
$\hat\beta$ Estimated exposure-outcome causal effect
$X$ Exposure
$Y$ Outcome
$U$ Unmeasured variables which confound the exposure-outcome relationship
$G_j$ Genotype of genetic variant $j$, $G_j \in \{0,1,2\}$
$\gamma_j$ True effect of genetic variant $j$ on the exposure
$\hat\gamma_j$ Estimated association between genetic variant $j$ and the exposure
$\widehat{\gamma_{*_j}}$ Estimated association between genetic variant $j$ and the exposure in fraction $*$ of the full dataset
$\sigma_{X_j}$ True standard error of the variant-exposure association estimate, $\widehat{\gamma_j}$, for variant $j$
$\widehat{\sigma_{X_j}}$ Estimated standard error of the variant-exposure association estimate, $\widehat{\gamma_j}$, for variant $j$
$\widehat{\sigma_{X_{*,j}}}$ Estimated standard error of the variant-exposure association estimate, $\widehat{\gamma_{*_j}}$, for variant $j$ in fraction $*$ of the full dataset
$\widehat{\Gamma_j}$ Estimated association between genetic variant $j$ and the outcome
$\widehat{\Gamma_{*_j}}$ Estimated association between genetic variant $j$ and the outcome in fraction $*$ of the full dataset
$\sigma_{Y_j}$ True standard error of the variant-outcome association estimate, $\widehat{\Gamma_j}$,for variant $j$
$\widehat{\sigma_{Y_j}}$ Estimated standard error of the variant-outcome association estimate, $\widehat{\Gamma_j}$,for variant $j$
$\widehat{\sigma_{Y_{*,j}}}$ Estimated standard error of the variant-outcome association estimate, $\widehat{\Gamma_j}$,for variant $j$ in fraction $*$ of the full dataset
$n$ Number of measured genetic variants
$N$ Number of participants in the full data set, i.e. number of individuals with genotype information, exposure and/or outcome measurements
$\{\mathbf{G}, \mathbf{X},\mathbf{Y}\}$ Full individual-level dataset in which $\mathbf{G}$ represents the matrix of genotype information for each genetic variant and participant, i.e. $\mathbf{G} = \{\mathbf{G}_1, \dots, \mathbf{G}_N\}\in \mathbb{R}^{N\times n}$, where $\mathbf{G}_j = [G_{1j}, \dots, G_{Nj}]^T$ contains the genotype information for the $j$th variant, $\mathbf{X} = [X_1, \dots,X_N]^T$ is the vector of measured exposures, and $\mathbf{Y} = [Y_1, \dots, Y_N]^T$ is the vector of outcome values
$\{\mathbf{G}_{*}, \mathbf{X}_{*},\mathbf{Y}_{*}\}$ Individual-level data set, similar to that above, but only containing data for those individuals in fraction $*$ of the full dataset
$N_\text{overlap}$ Number of overlapping individuals, i.e. number of participants who have measured values for both outcome and exposure
$\pi$ First fraction in which the full dataset is split into – fraction $\pi$ is used to select genetic instruments
$p$ Second fraction in which the full dataset is split into – fraction $1-\pi$ is split into fractions $p$ and $1-p$ to estimate variant-exposure and variant-outcome associations
$\rho$ Correlation between the exposure and the outcome, i.e. $\rho = \text{cor}(X,Y)$
$z_{X_{\pi,j}}$ $z$-statistic of variant $j$ in fraction $\pi$ of the dataset, i.e. $z_{X_{\pi,j}} = \frac{\widehat{\gamma_{\pi_j}}}{\widehat{\sigma_{X_{\pi,j}}}}$ - used to select genetic instruments
$n_\text{sig}$ Number of genetic instruments which have been selected using the (simulated) summary statistics from the first fraction, $\pi$, of the dataset
$N_\text{iter}$ Number of iterations of sample splitting performed by MR-SimSS
$\hat\beta^{(k)}$ Causal effect estimate produced by MR-SimSS on iteration $k$
$\overline{\hat\beta}$ Final estimate for the exposure-outcome causal effect supplied by MR-SimSS, i.e. average of the estimates generated at each iteration
$\text{se}(\overline{\hat\beta})$ Standard error of MR-SimSS causal effect estimate
$\text{se}\left(\hat\beta^{(k)}\right)$ Standard error of causal effect estimate at iteration $k$, generally supplied by chosen summary-level MR method
$\lambda$ Symbol representing $\frac{N_{\text{overlap}}\rho}{\sqrt{N_X N_Y}}$ in which $N_X$ is the number of individuals in the full dataset with measured exposures, i.e. exposure GWAS sample size, and $N_Y$ is the number of individuals in the full dataset with measured outcomes, i.e. outcome GWAS sample size