--- title: "Using gpmapr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using gpmapr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE ) ``` ```{r setup} library(gpmapr) ``` # Introduction The gpmapr package provides a set of functions to interact with the GPMap API. # Getting started The first port of entry is the `search_gpmap` function. This function allows you to search for a trait, gene, or variant. ```{r search-trait} haemoglobin <- search_gpmap("haemoglobin") knitr::kable(haemoglobin[, c("name", "sample_size", "num_coloc_groups", "call")]) ``` You can see information about the different traits that match your search, how to subsequently get more information (under `call`), and metadata about the trait (under `info`). ### Understanding `info` * Extractions: Each trait has a number of extractions, related to the number of finemapped ld regions that had a p-value < 1.5e-4. * Colocalisation Groups: The number of colocalisation groups associated with the trait / gene / variant. * Colocalising Traits: The number of traits that are colocalising with the trait / gene / variant. * Rare Result Groups: Each trait has a number of rare result groups, related to the number of groups of rare result groups that had a p-value < 1.5e-4. You can also search for a gene (by ENSG ID or gene name) or variant (by CHR:BP or rsID). ```{r other-searches} knitr::kable(search_gpmap("SLC44A")[, c("name", "sample_size", "num_study_extractions", "num_coloc_groups", "call")]) knitr::kable(search_gpmap("22:37042914")) ``` ### Variant search returns proxy variants The variant search will return the original variant and any proxy variants that are in LD with the original variant, which also have colocalisation results. Each variant is marked as either `original_variant` or `proxy_variant`. In this example, the original variant has no colocalisation results, but the proxy variant does. The proxy variant is also marked as having a r^2 of 0.96 ```{r proxy-variants} proxy_variants <- search_gpmap("rs17078078") knitr::kable(proxy_variants) ``` # Diving in You can get more information about a trait by subsequently calling the `trait()`, `gene()`, or `variant()` functions. This will provide you with the colocalisation results, rare disease results, and study extractions. Each trait, gene, or variant call have optional parameters: * `include_associations`: Whether to include associations (BETA, SE, P) of the variants in which it's colocalised in the output. * `include_coloc_pairs`: Whether to include coloc pairs in the output (which include h3 and h4), as opposed to just the coloc groups. * `h4_threshold`: The h4 threshold for coloc pairs. ## Trait example ```{r trait-example} hemoglobin_trait <- trait(931) names(hemoglobin_trait) nrow(hemoglobin_trait$coloc_groups) nrow(hemoglobin_trait$study_extractions) ``` This returns a list of elements, which you can access by name. More detailed explanations of the below are described in the API documentation. * `trait`: A list containing metadata about the trait, including common and rare studies associated with the trait * `coloc_groups`: a dataframe containing information about which studies have coloc results for this trait. * `study_extractions`: a list of dataframes containing the study extractions for this trait. * `rare_results`: (if they exist) a list of dataframes containing the rare results for this trait * `associations`: (optional) a dataframe containing the associations for the variants in which the trait is colocalised * `coloc_pairs`: (optional) a dataframe containing all pairwise coloc results for this trait. ### How `study_extractions`, `coloc_groups`, and `rare_results` relate All three layers refer to **study extractions** (GWAS hits in finemapped LD regions), keyed by `id` in `study_extractions` and by `study_extraction_id` in `coloc_groups` and `rare_results`. Think of **`study_extractions`** as the catalogue of extractions linked to the trait or gene in the map: each row is a study signal at a locus. Not every listed extraction necessarily has a full colocalisation story or a rare-disease result row. - **`coloc_groups`** holds extractions that sit inside a **colocalisation group** (shared coloc evidence with other studies/traits at that locus). If a study extraction appears here, it has coloc-structured results for that entity. - **`rare_results`** holds extractions that belong to a **rare-result group** (rare-disease-style finemapping in the API). The same biological signal can, in principle, be reflected in both rare and common coloc paths, so overlap between this set of extraction ids and those in `coloc_groups` is possible. - **Neither table:** Some extractions appear only under **`study_extractions`**. They are still tied to the trait or gene in the genotype–phenotype map (for example via **locus tagging, proximity, or other association logic** in the pipeline), but they do not currently carry a coloc-group row or a rare-result group row. Treat those as “present on the map at this locus” rather than “absent by mistake.” The same idea applies when you call **`gene()`** (or other entity-specific endpoints): `study_extractions` is the wider catalogue, while `coloc_groups` and `rare_results` subdivide it where those analyses exist. The table below illustrates counts for the hemoglobin trait example (`trait(931)`). ```{r trait-extraction-membership} extraction_ids <- function(x) { if (is.null(x) || !length(x)) { return(integer(0)) } id_col <- if ("id" %in% names(x)) "id" else NULL if (!is.null(id_col)) { return(unique(stats::na.omit(x[[id_col]]))) } unique(stats::na.omit(unlist(lapply(x, function(df) { if (!is.data.frame(df)) { return(NULL) } if ("id" %in% names(df)) { return(df[["id"]]) } NULL })))) } cg_ids <- unique(stats::na.omit(hemoglobin_trait$coloc_groups$study_extraction_id)) rr <- hemoglobin_trait$rare_results rr_ids <- if (is.null(rr) || !length(rr)) { integer(0) } else { unique(stats::na.omit(unlist(lapply(rr, function(d) { if (!is.data.frame(d) || !"study_extraction_id" %in% names(d)) { return(NULL) } d[["study_extraction_id"]] })))) } se_ids <- extraction_ids(hemoglobin_trait$study_extractions) only_study_extractions <- setdiff(se_ids, union(cg_ids, rr_ids)) knitr::kable(data.frame( source = c( "Distinct study_extraction_id in coloc_groups", "Distinct study_extraction_id in rare_results", "In study_extractions only (not in coloc_groups nor rare_results)", "Total distinct extraction id in study_extractions" ), n_extractions = c( length(cg_ids), length(rr_ids), length(only_study_extractions), length(se_ids) ) )) ``` # Gene example ```{r gene-example} slc44a1 <- gene("SLC44A1") names(slc44a1) nrow(slc44a1$coloc_groups) knitr::kable(head(slc44a1$coloc_groups[, c("coloc_group_id", "trait_name", "data_type", "tissue")])) ``` Note that by default, the gene endpoint includes trans genetic effects. You can turn this off with `gene("SLC44A1", include_trans = FALSE)`, or filter `coloc_groups` after the fact (for example `cis_trans == "cis"` or a `min_p` cutoff). After you subset rows, some `coloc_group_id` values may be left with only a single row. Those are no longer meaningful colocalisation **groups**, so drop them by keeping only groups with more than one row: ```{r gene-filter-example} slc44a1_no_trans <- slc44a1 slc44a1_no_trans$coloc_groups <- dplyr::filter( slc44a1_no_trans$coloc_groups, cis_trans == "cis" ) # or e.g. dplyr::filter(..., min_p <= 5e-8) slc44a1_no_trans$coloc_groups <- slc44a1_no_trans$coloc_groups |> dplyr::group_by(coloc_group_id) |> dplyr::filter(dplyr::n() > 1L) |> dplyr::ungroup() nrow(slc44a1_no_trans$coloc_groups) knitr::kable(head(slc44a1_no_trans$coloc_groups[, c("coloc_group_id", "trait_name", "data_type", "tissue")])) ``` ## Coloc pairs example Although coloc groups provide a helpful overview of the colocalisation results for a gene, you may want to look at the pairwise relationships between the studies. You can do this by setting `include_coloc_pairs = TRUE` in the call. As there may be many coloc pairs, the results are not as fleshed out as the coloc groups. You may have to filter the results to get the information you need, or join with the study extractions to get the full picture. ```{r coloc-pairs-example} slc44a1_coloc_pairs <- gene("SLC44A1", include_coloc_pairs = TRUE) nrow(slc44a1_coloc_pairs$coloc_pairs) knitr::kable(head(slc44a1_coloc_pairs$coloc_pairs[, c("study_extraction_a_id", "study_extraction_b_id", "h4")])) study_extractions_tbl <- dplyr::bind_rows(slc44a1_no_trans$study_extractions) study_extractions_subset <- dplyr::select(study_extractions_tbl, id, trait_name, min_p) hydrated_coloc_pairs <- slc44a1_coloc_pairs$coloc_pairs |> dplyr::left_join(slc44a1_coloc_pairs$study_extractions, by = c("study_extraction_a_id" = "id")) |> dplyr::rename(trait_a = trait_name) |> dplyr::left_join(slc44a1_coloc_pairs$study_extractions, by = c("study_extraction_b_id" = "id")) |> dplyr::rename(trait_b = trait_name) knitr::kable(head(hydrated_coloc_pairs[, c("trait_a", "trait_b", "h4")])) ``` # Target–indication prioritisation workflow: Alzheimer's disease This section demonstrates a systematic workflow for using the GPMap to prioritise gene targets for a disease of interest. We use **Alzheimer's disease (AD)** as an example. The logic proceeds through four steps: 1. **Get a list of genes mapped to AD** via colocalisation. 2. **Rank genes by pleiotropy score** — low-pleiotropy genes are more specific to the indication. 3. **Assess variant-level specificity** — if a variant associates with many cis-genes, it provides weaker evidence for any single target. 4. **Check consistency across multiple variants and rare results** — concordant signals from independent loci and rare variant evidence strengthen the target–indication link. ## Step 1: Genes mapped to Alzheimer's disease We start by searching for Alzheimer's traits and then pulling their colocalisation groups to identify the set of genes that have colocalisation evidence with AD. ```{r ad-search} ad_search <- search_gpmap("alzheimer") ad_trait_ids <- ad_search$type_id ad_coloc <- dplyr::bind_rows(lapply(ad_trait_ids, function(tid) { tryCatch(trait(tid)$coloc_groups, error = function(e) NULL) })) |> dplyr::filter(!grepl("methyl", .data$data_type, ignore.case = TRUE)) ad_genes <- ad_coloc |> dplyr::filter(!is.na(.data$gene_id)) |> dplyr::distinct(.data$gene_id, .data$gene) nrow(ad_genes) ``` ## Step 2: Rank by gene-level pleiotropy The GPMap provides a pre-computed pleiotropy score for every gene: `distinct_trait_categories` counts how many different phenotypic categories a gene's coloc groups span. A **low** value means the gene's associations are concentrated in a narrow phenotypic space — desirable for a drug target because it predicts fewer off-target effects. We can retrieve pleiotropy scores for all genes at once using `get_all_gene_pleiotropies()` and join them to our AD gene list. ```{r ad-pleiotropy} all_pleiotropy <- get_all_gene_pleiotropies() ad_genes_ranked <- ad_genes |> dplyr::inner_join( all_pleiotropy[, c("gene_id", "distinct_trait_categories", "distinct_protein_coding_genes")], by = "gene_id" ) |> dplyr::arrange(.data$distinct_trait_categories) ``` There are a total of `r nrow(ad_genes_ranked)` genes mapped to Alzheimer's disease. We can see some of the genes that are mapped to AD, and the number of colocalisation groups that each gene has. Genes at the top of this list — those with very few distinct trait categories — are the most "specific" to Alzheimer's disease. A gene appearing in many trait categories (high pleiotropy) is more likely to produce on-target side effects if therapeutically modulated. ## Step 3: Variant-level specificity and consistency across variants and rare results Each colocalisation group is anchored to a **variant** (an LD block). Within a single coloc group, the same variant may map to expression or protein QTLs of **multiple cis-genes**. When a variant associates with many cis-genes or trait categories, it provides weaker evidence that the gene would be a good drug target without potential off-target effects. The strongest evidence for a target-indication pair comes when **multiple independent variants** (different LD blocks) and **rare variant results** all point to the same gene. This replication across loci greatly reduces the chance that any single colocalisation is spurious. Let's pick some of the top-ranked genes from Step 2 and check how many independent coloc groups and rare result groups support them. ```{r ad-consistency} top_genes <- head(ad_genes_ranked, 40) gene_evidence <- lapply(top_genes$gene, function(g) { gi <- tryCatch(gene(g, include_associations = TRUE), error = function(e) NULL) if (is.null(gi)) return(NULL) cg <- gi$coloc_groups rr <- gi$rare_results ad_cg <- if (!is.null(cg) && nrow(cg) > 0) { cg[cg$trait_id %in% ad_trait_ids, ] } else { data.frame() } ad_rr <- if (!is.null(rr) && nrow(rr) > 0) { rr[rr$trait_id %in% ad_trait_ids, ] } else { data.frame() } n_tissues <- if (!is.null(cg) && nrow(cg) > 0 && "tissue" %in% names(cg)) { dplyr::n_distinct(cg$tissue[!is.na(cg$tissue)]) } else { 0L } data.frame( gene = g, n_coloc_groups = dplyr::n_distinct(ad_cg$coloc_group_id), n_tissues = n_tissues, n_rare_groups = if (nrow(ad_rr) > 0) dplyr::n_distinct(ad_rr$rare_result_group_id) else 0L ) }) |> dplyr::bind_rows() ad_summary <- ad_genes_ranked |> dplyr::filter(.data$gene %in% gene_evidence$gene) |> dplyr::left_join(gene_evidence, by = "gene") |> dplyr::select("gene", "distinct_trait_categories", "distinct_protein_coding_genes", "n_tissues", "n_coloc_groups", "n_rare_groups") |> dplyr::arrange(.data$distinct_trait_categories, .data$gene) ad_summary_long <- ad_summary |> tidyr::pivot_longer( cols = c("distinct_trait_categories", "distinct_protein_coding_genes", "n_tissues", "n_coloc_groups", "n_rare_groups"), names_to = "metric", values_to = "value" ) |> dplyr::mutate( metric = factor(.data$metric, levels = c("distinct_trait_categories", "distinct_protein_coding_genes", "n_tissues", "n_coloc_groups", "n_rare_groups"), labels = c("Trait categories", "Protein-coding genes", "Distinct tissues", "Colocalization groups", "Rare result groups") ), gene = factor(.data$gene, levels = rev(ad_summary$gene)) ) ``` ```{r ad-summary-plot, fig.width = 7, fig.height = 14} ad_summary_plot <- ggplot2::ggplot(ad_summary_long, ggplot2::aes(x = value, y = gene, fill = metric)) + ggplot2::geom_col(show.legend = FALSE, width = 0.5) + ggplot2::facet_wrap(~ metric, scales = "free_x", nrow = 1) + ggplot2::labs(x = NULL, y = NULL, title = "Alzheimer's disease gene targets: pleiotropy and evidence summary") + ggplot2::theme_bw(base_size = 9) + ggplot2::theme( strip.text = ggplot2::element_text(face = "bold"), panel.grid.minor = ggplot2::element_blank(), panel.grid.major.y = ggplot2::element_blank(), axis.text.y = ggplot2::element_text(size = 5) ) ad_summary_plot ``` Genes with **multiple coloc groups** and supporting **rare result groups** have the most robust genetic evidence linking them to Alzheimer's disease. Combined with low pleiotropy (few trait categories), these represent the highest-priority targets for further investigation. ## Diving deeper: TREM2 From the table above, **TREM2** stands out as a particularly interesting target for Alzheimer's disease. It has very low pleiotropy (few trait categories), meaning its genetic associations are highly specific to AD rather than spread across many unrelated phenotypes. At the same time, it is supported by multiple coloc groups and rare result groups — independent lines of genetic evidence converging on the same gene. Let's examine TREM2 in more detail by looking at the direction of effect across all its AD-related colocalisation and rare variant signals. ```{r trem2-deep-dive} trem2 <- gene("TREM2", include_associations = TRUE) trem2$gene$distinct_trait_categories trem2$gene$distinct_protein_coding_genes ``` ```{r trem2-ad-groups} trem2_ad_coloc_group_ids <- unique( trem2$coloc_groups[trem2$coloc_groups$trait_id %in% ad_trait_ids, ]$coloc_group_id ) trem2_ad_rare_group_ids <- if (!is.null(trem2$rare_results) && nrow(trem2$rare_results) > 0) { unique(trem2$rare_results[trem2$rare_results$trait_id %in% ad_trait_ids, ]$rare_result_group_id) } else { character(0) } trem2_ad_groups <- trem2$coloc_groups[ trem2$coloc_groups$coloc_group_id %in% trem2_ad_coloc_group_ids, ] trem2_ad_rare <- if (length(trem2_ad_rare_group_ids) == 0L) { trem2$rare_results[0, ] } else { trem2$rare_results[ trem2$rare_results$rare_result_group_id %in% trem2_ad_rare_group_ids, ] } ``` The forest plot below shows the effect sizes (beta with 95% CI) for all colocalisation and rare result entries related to Alzheimer's disease at TREM2. Panels are labelled with the variant and its MAF (minor allele frequency), and colour distinguishes **molecular phenotypes** (eQTL, pQTL, sQTL, etc.) from **traits** (disease GWAS). ```{r trem2-forest-helpers, include = FALSE} truncate_label <- function(x, max_chars = 50) { ifelse(nchar(x) > max_chars, paste0(substr(x, 1, max_chars - 3), "..."), x) } forest_row_label <- function(trait_name, data_type, tissue, max_chars = 50) { tn <- if (missing(tissue) || is.null(tissue)) { rep(NA_character_, length(trait_name)) } else { tissue } raw <- paste( trait_name, data_type, tidyr::replace_na(as.character(tn), ""), sep = " \u2014 " ) truncate_label(raw, max_chars) } data_type_labels <- c( splice_variant = "Splice Variant", transcript = "Transcript", gene_expression = "Gene Expression", protein = "Protein", methylation = "Methylation", metabolome = "Metabolome", cell_trait = "Cell Trait", plasma_protein = "Plasma Protein", phenotype = "Phenotype" ) label_data_type <- function(data_type) { labelled <- data_type_labels[data_type] ifelse(is.na(labelled), data_type, labelled) } ``` ```{r trem2-forest, fig.width = 6, fig.height = 7} trem2_all_rows <- dplyr::bind_rows( trem2_ad_groups, if (!is.null(trem2_ad_rare) && nrow(trem2_ad_rare) > 0L) trem2_ad_rare else NULL ) variant_maf <- trem2_all_rows |> dplyr::filter(!is.na(.data$eaf)) |> dplyr::mutate(maf = pmin(.data$eaf, 1 - .data$eaf)) |> dplyr::group_by(.data$variant_id) |> dplyr::summarise(maf = min(.data$maf, na.rm = TRUE), .groups = "drop") make_panel_label <- function(display_snp, variant_id, maf_lookup) { snp <- dplyr::coalesce(as.character(display_snp), paste0("variant ", as.character(variant_id))) maf <- maf_lookup$maf[match(variant_id, maf_lookup$variant_id)] maf_str <- ifelse(is.na(maf), "", paste0(" (MAF=", formatC(maf, format = "f", digits = 3), ")")) paste0(snp, maf_str) } trem2_forest_coloc <- trem2_ad_groups |> dplyr::filter(!is.na(.data$beta), !is.na(.data$se), .data$se > 0) |> dplyr::mutate( lo = .data$beta - 1.96 * .data$se, hi = .data$beta + 1.96 * .data$se, row_label = forest_row_label(.data$trait_name, .data$data_type, .data$tissue), panel_id = make_panel_label(.data$display_snp, .data$variant_id, variant_maf), data_type_label = label_data_type(.data$data_type) ) |> dplyr::group_by(.data$coloc_group_id) |> dplyr::arrange(.data$beta, .by_group = TRUE) |> dplyr::mutate(row_label = factor(.data$row_label, levels = unique(.data$row_label))) |> dplyr::ungroup() trem2_forest_rare <- if (is.null(trem2_ad_rare) || nrow(trem2_ad_rare) == 0L) { NULL } else { rr <- trem2_ad_rare if (!"tissue" %in% names(rr)) rr$tissue <- NA_character_ if (!"eaf" %in% names(rr)) rr$eaf <- NA_real_ rr |> dplyr::filter(!is.na(.data$beta), !is.na(.data$se), .data$se > 0) |> dplyr::mutate( lo = .data$beta - 1.96 * .data$se, hi = .data$beta + 1.96 * .data$se, row_label = forest_row_label(.data$trait_name, .data$data_type, .data$tissue), panel_id = make_panel_label(.data$display_snp, .data$variant_id, variant_maf), data_type_label = label_data_type(.data$data_type) ) |> dplyr::group_by(.data$panel_id) |> dplyr::arrange(.data$beta, .by_group = TRUE) |> dplyr::mutate(row_label = factor(.data$row_label, levels = unique(.data$row_label))) |> dplyr::ungroup() } trem2_forest <- dplyr::bind_rows(trem2_forest_coloc, trem2_forest_rare) |> dplyr::mutate(panel_id = factor(.data$panel_id, levels = unique(.data$panel_id))) cbf_palette <- c( "Gene Expression" = "#0072B2", "Splice Variant" = "#E69F00", "Protein" = "#009E73", "Plasma Protein" = "#56B4E9", "Transcript" = "#CC79A7", "Methylation" = "#F0E442", "Metabolome" = "#D55E00", "Cell Trait" = "#999999", "Phenotype" = "#000000" ) ggplot2::ggplot(trem2_forest, ggplot2::aes(x = beta, y = row_label, colour = data_type_label)) + ggplot2::geom_vline(xintercept = 0, linetype = 2, colour = "grey55") + ggplot2::geom_errorbarh( ggplot2::aes(xmin = lo, xmax = hi), height = 0, linewidth = 0.35 ) + ggplot2::geom_point(size = 2) + ggplot2::facet_wrap(~ panel_id, scales = "free_y", ncol = 1) + ggplot2::scale_colour_manual(values = cbf_palette) + ggplot2::labs( x = "Effect (beta) with 95% CI", y = NULL, colour = "Data type", title = "TREM2 \u2014 Alzheimer's disease" ) + ggplot2::theme_bw() + ggplot2::theme( strip.text = ggplot2::element_text(face = "bold"), panel.grid.minor = ggplot2::element_blank(), legend.position = "bottom", legend.direction = "vertical" ) + ggplot2::guides(colour = ggplot2::guide_legend(ncol = 1)) ``` TREM2 shows colocalisation evidence from multiple independent variants, supported by rare result groups. The direction of effect appears complex — in some instances increased TREM2 expression associates with increased AD risk, while in others the direction is reversed — reflecting the biological complexity of TREM2 signalling in neuroinflammation. Nonetheless, the convergence of evidence from multiple loci and data types makes it a compelling target, and it is indeed currently under active investigation as a therapeutic target for Alzheimer's disease. ## Putting it together This four-step workflow uses the GPMap to move from a broad disease search to a prioritised shortlist of gene targets: | Step | Question | GPMap function | |------|----------|----------------| | 1 | Which genes colocalise with my disease? | `search_gpmap()` → `trait()` | | 2 | Which are most specific (least pleiotropic)? | `get_all_gene_pleiotropies()` | | 3 | Do variants map cleanly to one cis-gene? | `coloc_groups$gene_id` per `coloc_group_id` | | 4 | Is the evidence replicated across loci? | `gene()` → count coloc groups, variants, rare results | A gene that scores well on all four criteria — colocalises with AD, has low pleiotropy, is supported by high-specificity variants, and is replicated across multiple loci including rare variants — is a strong candidate for target validation and drug development. # Accessing summary statistics You can only access the summary statistics via `variant()`, by including `include_summary_stats = TRUE` in the call. This is to limit the amount of data that can be returned from the API for cost reasons. If you attempt to download too many summary statistics, you may be blocked from the API. * `variant(1669064, include_summary_stats = TRUE)` Instead, if you are interested in accessing all the summary statistics, or the whole database of results, you can download all databases and summary statistics the [Google Cloud Storage Bucket](https://console.cloud.google.com/storage/browser/genotype-phenotype-map), and can find an [explanation of the files here](https://github.com/MRCIEU/genotype-phenotype-map/wiki/Public-Data-Bucket). Note that the bucket is 'requester pays', so you will need to have a Google Cloud account and be logged in to download the files.