Tutorial 3: Regional genotype-phenotype map

remotes::install_github('ritarasteiro/hyprcoloc', build_opts = c('--resave-data', '--no-manual'), build_vignettes = TRUE) 
remotes::install_github('ritarasteiro/susieR')

library(gwasglue2)
library(ieugwasr)
library(susieR)     # fork that takes Summaryset
library(hyprcoloc)  # fork that takes Dataset
devtools::load_all("../") # this was added just for development

The use case for this tutorial will be generating a regional genotype-phenotype map for traits with significant associations in a known LD block on chromosome 22.

First, we will set the path to where the reference population plink bed files are located to build the LD matrix

bed_ref <- "../data/ld/EUR"

and define a genomic range for the analysis to be performed. This is a known LD block on chromosome 22 (build hg19)

genomicrange <- "22:43714200-44995307"

We need to scan OpenGWAS for traits that have a significant associations in this region

id_list <- ieugwasr::phewas(pval = 5e-6, variants=genomicrange, batch="ukb-a")$id %>%
  unique()

and then obtain the metadata for each trait

# get metadata and create metadata objects
metadata <- lapply(seq_along(id_list), function(i){
  m <- create_metadata(ieugwasr::gwasinfo(id_list[i])) 
})

Finally, we create the summarysets for each trait and the dataset, which comprise all the summarysets. Then we harmonise trait against trait followed by each trait against a LD matrix. Note that the dataset is being created through a chain of multiple operations using pipes. Unlike the example in Tutorial 1, the sumarysets are not saved separately.

# create gwasglue2 DataSet object 
  dataset <- lapply(seq_along(id_list), function(i){
    # create summarysets
    dt <- create_summaryset(ieugwasr::associations(variants = genomicrange, id = id_list[i]), metadata=metadata[[i]])
  }) %>%
    # create dataset and harmonise trait againts trait
    create_dataset(., harmonise = TRUE, tolerance = 0.08, action = 1) %>% 
    # harmonise dataset against LD matrix
    harmonise_ld(., bfile = bed_ref , plink_bin = "plink")

Next, for each trait we need to perform finemapping. This will involve using SusieR to identify credible sets in the region for the trait, and then generate the conditional summary datasets for each credible set.

Explanation:

Credible set: If a trait is inferred to have e.g. 3 causal variants in the genomic region then there will be 3 credible sets. Each credible set is essentially a cluster of SNPs close by, it will include 1 or more SNPs that are candidates to be the causal variant.

Conditional summary datasets: Once credible sets are identified, we can try to estimate what the summary data results would have looked like for each credible set after accounting for all the other credible sets. So in this example there would be 3 conditional summary datasets, each with a single peak in a different location.

We use a modified version of susieR (use the ritarasteiro/susieR fork) that reads a gwasglue2 dataset to generate the credible sets for each summaryset within the dataset (e.g. some of them will have only 1, some will have more than 1. But all of them will have at least 1 because they have been selected to have an association in the region)

TODO: Explain how we marginalise the summarysets using susieR Approximate Bayes Factors.

After summary sets are harmonised, marginalise each summary set independently and create a new dataset with all marginalised summary sets merged

# do finemapping with susie
ntraits <- getLength(dataset)
dataset_marginalised <- lapply(1:ntraits, function(trait)
  {
    # Takes in 1 SS
    # Outputs 1 DS (with at least 1 SS)
    ds <- susie_rss(R = getLDMatrix(dataset), summaryset = getSummarySet(dataset, trait))
})

Now we are going to merge the datasets. Note that the merge_datasets() function is just going to merge the marginalised datasets.

dataset_marginalised <- merge_datasets(dataset_marginalised)

Try hyprcoloc with raw datasets. (Use fork of hypercoloc that allows a dataset object to be provided)


res_dataset <- hyprcoloc(dataset = dataset) 
#hyprcoloc is using gwasglue2 DataSet class object as input
print(res_dataset)

Try hyprcoloc with marginalised datasets


res_dataset_marginalised <- hyprcoloc(dataset = dataset_marginalised)
print(res_dataset_marginalised)

NOTE: In this example there was no colocalisation for the traits in IEU batch "ukb-a"..