---
title: "Mendelian randomization"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Mendelian randomization}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval=FALSE
)
```

To use the [IEU GWAS database](https://gwas.mrcieu.ac.uk/) for MR analysis, see the [TwoSampleMR](https://mrcieu.github.io/TwoSampleMR/) R package.

Here we'll demonstrate how to achieve the same data extractions using the [GWAS VCF](https://github.com/MRCIEU/gwas_vcf_spec) files.

We'll use the example of LDL cholesterol [ieu-a-300](https://gwas.mrcieu.ac.uk/datasets/ieu-a-300/) and coronary heart disease [ieu-a-7](https://gwas.mrcieu.ac.uk/datasets/ieu-a-7/). 

Load libraries:

```{r}
suppressPackageStartupMessages(suppressWarnings({
	library(TwoSampleMR)
	library(gwasglue)
	library(gwasvcf)
	library(ieugwasr)
	library(dplyr)
}))
```

## Using TwoSampleMR

This is a simple procedure for MR using the TwoSampleMR package:

```{r}
# Extract the instruments for LDL
expd1 <- TwoSampleMR::extract_instruments("ieu-a-300")

# Extract those SNP effects for CHD
outd1 <- TwoSampleMR::extract_outcome_data(expd1$SNP, "ieu-a-7", proxies=FALSE)

# Harmonise the exposure and outcome data
dat1 <- TwoSampleMR::harmonise_data(expd1, outd1)

# Perform MR
TwoSampleMR::mr(dat1)
```

Note that this extraction process can be simplified with:

```{r, eval=FALSE}
dat1 <- make_dat("ieu-a-300", "ieu-a-7")
```

## Using GWAS VCF files

Let's do the same with the vcf files (and the indexes). Download from here:

```bash
wget https://gwas.mrcieu.ac.uk/files/ieu-a-300/ieu-a-300.vcf.gz
wget https://gwas.mrcieu.ac.uk/files/ieu-a-300/ieu-a-300.vcf.gz.tbi
wget https://gwas.mrcieu.ac.uk/files/ieu-a-7/ieu-a-7.vcf.gz
wget https://gwas.mrcieu.ac.uk/files/ieu-a-7/ieu-a-7.vcf.gz.tbi
```

First get the tophits for LDL cholesterol

```{r}
gwasvcf::set_bcftools()
expd2 <- gwasvcf::query_gwas("ieu-a-300.vcf.gz", chrompos=paste0(expd1$chr.exposure, ":", expd1$pos.exposure))
```

Convert to TwoSampleMR format:

```{r}
expd2 <- gwasglue::gwasvcf_to_TwoSampleMR(expd2, type="exposure")
```

Extract those SNPs from the outcome vcf file and convert to TwoSampleMR format

```{r}
outd2 <- gwasvcf::query_gwas("ieu-a-7.vcf.gz", chrompos = paste0(expd1$chr.exposure, ":", expd1$pos.exposure))
outd2 <- gwasglue::gwasvcf_to_TwoSampleMR(outd2, "outcome")
```

Proceed with harmonising and performing MR

```{r}
dat2 <- TwoSampleMR::harmonise_data(expd2, outd2)
TwoSampleMR::mr(dat2)
```


## Other options

### Clumping vcf files

If we want to extract top hits based on a threshold and clump locally. First download the LD reference dataset:

```bash
wget http://fileserve.mrcieu.ac.uk/ld/data_maf0.01_rs_ref.tgz
tar xzvf data_maf0.01_rs_ref.tgz
```

Now extract the top hits based on a p-value threshold

```{r}
gwasvcf::set_bcftools()
expd3 <- gwasvcf::query_gwas("ieu-a-300.vcf.gz", pval=5e-8)
expd3
```

Convert to TwoSampleMR format:

```{r}
expd3 <- gwasglue::gwasvcf_to_TwoSampleMR(expd3, type="exposure")
```

Get a list of SNPs to retain after clumping and subset the data

```{r}
retain_snps <- expd3 %>% dplyr::select(rsid=SNP, pval=pval.exposure) %>%
	ieugwasr::ld_clump(., plink_bin=genetics.binaRies::get_plink_binary(), bfile="data_maf0.01_rs_ref") %>%
	{.$rsid}
expd3 <- subset(expd3, SNP %in% retain_snps)
```

### Extracting outcome data with LD proxies

This only works if you extract on rsids at the moment, and is quite slow. But here it is:

```{r}
gwasvcf::set_plink()
outd2 <- gwasvcf::query_gwas("ieu-a-7.vcf.gz", rsid = expd3$SNP, proxies="yes", bfile="data_maf0.01_rs_ref")
outd2 <- gwasglue::gwasvcf_to_TwoSampleMR(outd2, "outcome")
```


## Further MR methods

A number of other MR methods are available. The current framework for using them is to:

1. Convert your data to TwoSampleMR format
2. Use the TwoSampleMR package to convert to other formats

Examples of formats that you can convert to within TwoSampleMR:

- [MR-PRESSO](https://github.com/rondolab/MR-PRESSO)
- [RadialMR](https://github.com/WSpiller/RadialMR)
- [MRMix](https://github.com/gqi/MRMix/)
- [MendelianRandomization](https://cran.r-project.org/web/packages/MendelianRandomization/index.html)
- [MR-RAPS](https://cran.r-project.org/web/packages/mr.raps/)


## Bluecrystal4 users

All data in OpenGWAS is stored on bc4 in the form of GWAS VCF files. You can create harmonised datasets easily on bc4 with these files.

Determine the locations of the GWAS VCF files and a number of other reference datasets and binaries:

```{r, eval=FALSE}
set_bcftools()
set_plink()
set_bc4_files()
```

Now simply run:

```{r, eval=FALSE}
dat <- make_TwoSampleMR_dat("ieu-a-300", "ieu-a-7")
```

This can be run in parallel for large combinations of exposures and outcomes, e.g.:

```{r, eval=FALSE}
dat <- make_TwoSampleMR_dat(
	id1=c("ieu-a-300", "ieu-a-302", "ieu-a-299"), 
	id2=c("ieu-a-7", "ieu-a-2"),
	nthreads=6
)
```

This will lookup all instruments in the exposures (id1) in both exposures and outcomes, and harmonise all exposure-outcome pairs, parallelised across 6 threads.

Note: please make sure to run these analyses in batch mode by submitting to the slurm scheduler, not interactively on the head nodes!