Title: | Slope-Hunter: A ROBUST METHOD FOR COLLIDER BIAS CORRECTION IN CONDITIONAL GENOME-WIDE ASSOCIATION STUDIES |
---|---|
Description: | Studying genetic associations with prognosis (e.g. survival, disability, subsequent disease events) is problematic due to selection bias - also termed index event bias or collider bias - whereby selection on disease status can induce associations between causes of incidence with prognosis. The Slope-Hunter approach adjusts genetic associations for this bias assuming that the contribution of the set of genetic variants affecting incidence only to the heritability of incidence is at least as large as the contribution of those affecting both incidence and prognosis. |
Authors: | Osama Mahmoud [aut, cre, cph] |
Maintainer: | Osama Mahmoud <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 1.1.0 |
Built: | 2024-12-26 05:13:19 UTC |
Source: | https://github.com/Osmahmoud/SlopeHunter |
A simulated dataset for 10,000 independent variables (e.g. SNPs) consisting of regression coefficients on incidence and prognosis, with their standard errors. Among all the SNPs, 5% (500 variables) have effects on incidence only, 5% (500 variables) on prognosis only, and 5% have correlated effects on both with a correlation coeficient of '-0.5'. The estimates are obtained from linear regression in a simulated dataset of 20,000 individuals.
data_example
data_example
A data frame with 10,000 rows and 5 variables:
Regression coefficient on incidence
Standard error of xbeta
Regression coefficient on prognosis
Standard error of ybeta
P-value of the association with prognosis
# Load the \code{SlopeHunter} package require(SlopeHunter) # Load the input data set data(data_example, package = "SlopeHunter") head(data_example) # Implement the Slope-Hunter method Sh.Model <- hunt(dat = data_example, xbeta_col="xbeta", xse_col="xse", ybeta_col="ybeta", yse_col="yse", yp_col="yp", xp_thresh = 0.001, Bootstrapping = TRUE, show_adjustments = TRUE, seed=2021) # [1] "Estimated slope: -0.274120383700514" # [1] "SE of the slope: 0.0229566376478153" # [1] "95% CI: -0.319115393490232, -0.229125373910796" # Display the estimated slope (adjustment factor) Sh.Model$b # [1] -0.2741204 # Extract information about cluster memberships of SNPs included in the analysis Adj <- Sh.Model$Fit # Show the first 6 values of the unadjusted estimated effects on prognosis head(data_example$ybeta) # [1] -0.0092889266 0.0005575032 0.0112203795 -0.0095533069 0.0082635203 0.0026550045 # Show results of the first 6 corrected variants: head(Sh.Model$est) # xbeta xse ybeta yse yp xp SNP ybeta_adj yse_adj yp_adj # 1 -0.007 0.007 -0.009 0.006 0.136 0.300 snp1 -0.011 0.006 0.083 # 2 0.014 0.007 0.000 0.006 0.928 0.042 snp2 0.004 0.006 0.492 # 3 -0.011 0.007 0.011 0.006 0.072 0.097 snp3 0.008 0.006 0.220 # 4 0.004 0.007 -0.009 0.006 0.125 0.493 snp4 -0.008 0.006 0.208 # 5 -0.025 0.007 0.008 0.006 0.185 0.000 snp5 0.001 0.006 0.851 # 6 0.013 0.007 0.002 0.006 0.670 0.054 snp6 0.006 0.006 0.329 # Generate an interactive plot for the estimated clusters (hover on the data points to view info) require(ggplot2) require(plotly) ggplotly(Sh.Model$plot)
# Load the \code{SlopeHunter} package require(SlopeHunter) # Load the input data set data(data_example, package = "SlopeHunter") head(data_example) # Implement the Slope-Hunter method Sh.Model <- hunt(dat = data_example, xbeta_col="xbeta", xse_col="xse", ybeta_col="ybeta", yse_col="yse", yp_col="yp", xp_thresh = 0.001, Bootstrapping = TRUE, show_adjustments = TRUE, seed=2021) # [1] "Estimated slope: -0.274120383700514" # [1] "SE of the slope: 0.0229566376478153" # [1] "95% CI: -0.319115393490232, -0.229125373910796" # Display the estimated slope (adjustment factor) Sh.Model$b # [1] -0.2741204 # Extract information about cluster memberships of SNPs included in the analysis Adj <- Sh.Model$Fit # Show the first 6 values of the unadjusted estimated effects on prognosis head(data_example$ybeta) # [1] -0.0092889266 0.0005575032 0.0112203795 -0.0095533069 0.0082635203 0.0026550045 # Show results of the first 6 corrected variants: head(Sh.Model$est) # xbeta xse ybeta yse yp xp SNP ybeta_adj yse_adj yp_adj # 1 -0.007 0.007 -0.009 0.006 0.136 0.300 snp1 -0.011 0.006 0.083 # 2 0.014 0.007 0.000 0.006 0.928 0.042 snp2 0.004 0.006 0.492 # 3 -0.011 0.007 0.011 0.006 0.072 0.097 snp3 0.008 0.006 0.220 # 4 0.004 0.007 -0.009 0.006 0.125 0.493 snp4 -0.008 0.006 0.208 # 5 -0.025 0.007 0.008 0.006 0.185 0.000 snp5 0.001 0.006 0.851 # 6 0.013 0.007 0.002 0.006 0.670 0.054 snp6 0.006 0.006 0.329 # Generate an interactive plot for the estimated clusters (hover on the data points to view info) require(ggplot2) require(plotly) ggplotly(Sh.Model$plot)
Check and download PLINK 1.90 executable suitable for the operating system, and return its path Inspired by https://github.com/MRCIEU/genetics.binaRies
download_plink()
download_plink()
Reads in and format input data. It checks and organises columns for Slope-Hunter analyses. Infers p-values when possible from beta and se.
format_data( dat, type = "incidence", snps = NULL, snp_col = "SNP", beta_col = "BETA", se_col = "SE", pval_col = "PVAL", eaf_col = "EAF", effect_allele_col = "EA", other_allele_col = "OA", gene_col = "GENE", chr_col = "CHR", pos_col = "POS", min_pval = 1e-200, log_pval = FALSE )
format_data( dat, type = "incidence", snps = NULL, snp_col = "SNP", beta_col = "BETA", se_col = "SE", pval_col = "PVAL", eaf_col = "EAF", effect_allele_col = "EA", other_allele_col = "OA", gene_col = "GENE", chr_col = "CHR", pos_col = "POS", min_pval = 1e-200, log_pval = FALSE )
dat |
Data frame. Must have header with at least the |
type |
Is this the incidence or the prognosis data that is being read in? The default is |
snps |
SNPs to extract. If NULL, then it keeps all. The default is |
snp_col |
Required name of column with SNP rs IDs. The default is |
beta_col |
Required name of column with effect sizes. The default is |
se_col |
Required name of column with standard errors. The default is |
pval_col |
Name of column with p-value (optional). The default is |
eaf_col |
Name of column with effect allele frequency (optional). The default is |
effect_allele_col |
Required for harmonisation. Name of column with effect allele. Must be "A", "C", "T" or "G". The default is |
other_allele_col |
Required for harmonisation. Name of column with non-effect allele. Must be "A", "C", "T" or "G". The default is |
gene_col |
Optional column for gene name. The default is |
chr_col |
Optional column for chromosome number. The default is |
pos_col |
Optional column for SNP position. The default is |
min_pval |
Minimum allowed p-value. The default is |
log_pval |
The p-value is -log10(P). The default is |
data frame
Harmonise the alleles and effects between the incidence and prognosis (inspired by https://github.com/MRCIEU/TwoSampleMR/blob/master/R/harmonise.R)
harmonise_effects( incidence_dat, prognosis_dat, incidence_formatted = TRUE, prognosis_formatted = TRUE, by.pos = FALSE, pos_cols = c("POS.incidence", "POS.prognosis"), snp_cols = c("SNP", "SNP"), beta_cols = c("BETA.incidence", "BETA.prognosis"), se_cols = c("SE.incidence", "SE.prognosis"), EA_cols = c("EA.incidence", "EA.prognosis"), OA_cols = c("OA.incidence", "OA.prognosis"), chr_cols = c("CHR.incidence", "CHR.prognosis"), gene_col = c("GENE.incidence", "GENE.prognosis") )
harmonise_effects( incidence_dat, prognosis_dat, incidence_formatted = TRUE, prognosis_formatted = TRUE, by.pos = FALSE, pos_cols = c("POS.incidence", "POS.prognosis"), snp_cols = c("SNP", "SNP"), beta_cols = c("BETA.incidence", "BETA.prognosis"), se_cols = c("SE.incidence", "SE.prognosis"), EA_cols = c("EA.incidence", "EA.prognosis"), OA_cols = c("OA.incidence", "OA.prognosis"), chr_cols = c("CHR.incidence", "CHR.prognosis"), gene_col = c("GENE.incidence", "GENE.prognosis") )
incidence_dat |
data.table for incidence data. It is recommended to be an output from |
prognosis_dat |
data.table for prognosis data. It is recommended to be an output from |
incidence_formatted |
Logical indicationg whether |
prognosis_formatted |
Logical indicationg whether |
by.pos |
Logical, if |
pos_cols |
A vector of length 2 specifying the name of the genetic position columns in the incidence and prognosis datasets respectively. |
snp_cols |
A vector of length 2 specifying the name of the snp columns in the incidence and prognosis datasets respectively. This is the column on which the data will be merged if |
beta_cols |
A vector of length 2 specifying the name of the beta columns in the incidence and prognosis datasets respectively. |
se_cols |
A vector of length 2 specifying the name of the se columns in the incidence and prognosis datasets respectively. |
EA_cols |
A vector of length 2 specifying the name of the effect allele columns in the incidence and prognosis datasets respectively. |
OA_cols |
A vector of length 2 specifying the name of the non-effect allele columns in the incidence and prognosis datasets respectively. |
chr_cols |
A vector of length 2 specifying the name of the chromosome columns in the incidence and prognosis datasets respectively. |
gene_col |
A vector of length 2 specifying the name of the gene columns in the incidence and prognosis datasets respectively. |
In order to perform Slope-Hunter analysis the effect of a SNP on an incidence and prognosis traits must be harmonised to be relative to the same allele.
This function will try to harmonise the incidence and prognosis data sets on the specified columns. Where necessary, correct strand for non-palindromic SNPs (i.e. flip the sign of effects so that the effect allele is the same in both datasets), and drop all palindromic SNPs from the analysis (i.e. with the allele A/T or G/C). The alleles that do not match between data sets (e.g T/C in one data set and A/C in the other) will also be dropped.
A data.frame with harmonised effects and alleles
Estimate collider bias
hunt( dat, snp_col = "SNP", xbeta_col = "BETA.incidence", xse_col = "SE.incidence", xp_col = "Pval.incidence", ybeta_col = "BETA.prognosis", yse_col = "SE.prognosis", yp_col = "Pval.prognosis", xp_thresh = 0.001, init_pi = 0.6, init_sigmaIP = 1e-05, Bootstrapping = TRUE, M = 100, seed = 777, Plot = TRUE, show_adjustments = FALSE )
hunt( dat, snp_col = "SNP", xbeta_col = "BETA.incidence", xse_col = "SE.incidence", xp_col = "Pval.incidence", ybeta_col = "BETA.prognosis", yse_col = "SE.prognosis", yp_col = "Pval.prognosis", xp_thresh = 0.001, init_pi = 0.6, init_sigmaIP = 1e-05, Bootstrapping = TRUE, M = 100, seed = 777, Plot = TRUE, show_adjustments = FALSE )
dat |
Data frame. Must have header with at least the |
snp_col |
Name of column with SNP IDs. |
xbeta_col |
Required name of column with effects on the incidence trait. |
xse_col |
Required name of column with standard errors of |
xp_col |
Name of column with p-value for |
ybeta_col |
Required name of column with unadjusted effects on the prognosis trait. |
yse_col |
Required name of column with standard errors of |
yp_col |
Name of column with p-value for |
xp_thresh |
p-value threshold for SNP-incidence associations. Effects with p-values larger than |
init_pi |
initial value for the weight of the mixture component that represents the cluster of SNPs affecting x only. |
init_sigmaIP |
initial value for the covariance between x and y. |
Bootstrapping |
Logical, if TRUE estimate the standard error of the adjustment factor using the Bootstrap method. |
M |
Number of bootstrap samples drawn to estimate the standard error of the adjustment factor. |
seed |
Random number seed used for drawing the bootstrap samples. |
Plot |
Logical, if TRUE (the default), calling the function should plot the final clusters. |
show_adjustments |
Logical indicating whether to show adjusted effects of the given SNPs in the outputs. |
List of the following:
est: estimated adjusted associations, their standard errors and p-values (only if show_adjustments
is TRUE).
b: The estimated slope (adjustment factor).
bse: Standard error of the estimated slope.
b_CI: 95\
pi: Estimated probability of the mixture component of SNPs affecting only incidence.
entropy: The entropy of the estimated clusters.
plot: Generated plot of the SlopeHunter fitted model.
Fit: a Data frame summarising the fitted model-based clustering with the following columns:
cluster: cluster of the variants defined as follows:
Hunted
= assigned to the cluster of SNPs affecting only incidence.
Pleiotropic
= assigned to the cluster affecting both incidence and prognosis - i.e. variants that affect incidence and have direct effect on prognosis.
pt and p0: membership probabilities of the variants for the hunted and pleiotropic clusters respectively.
associations of variants with x and y, their standard errors and p-values.
iter: Number of the EM algorithm's iterations.
Bts.est: Details on the bootstrap estimate of the standard error of the adjustment factor, if Bootstrapping
is TRUE.
clump function using local plink binary and ld reference dataset This function is modified from: https://github.com/MRCIEU/ieugwasr/blob/master/R/ld_clump.R
ld_local(dat, clump_kb = 250, clump_r2 = 0.1, clump_p1 = 1, bfile)
ld_local(dat, clump_kb = 250, clump_r2 = 0.1, clump_p1 = 1, bfile)
dat |
Dataframe. Must have a variant name column ( |
clump_kb |
Clumping window, default is |
clump_r2 |
Clumping r-squared threshold, default is |
clump_p1 |
Clumping sig level for index SNPs, default is |
bfile |
Path to the bed/bim/fam LD reference (e.g. "1kg.v3/EUR" for local 1000 EUR ref. population file). |
Uses PLINK clumping method ('–clump' command), where a greedy search algorithm is implemented to randomly select a variant (or the variant with the lowest p-value, if a user wish to), referred to as the index SNP, and remove all variants within a certain kb distance in linkage disequilibrium with the index SNP, based on an r-squared threshold from the 1000 Genomes reference panel phase 3 data. Then repeats until no variants are left.
LD_prune( dat, clump_kb = 250, clump_r2 = 0.1, Random = TRUE, clump_p1 = 1, local = FALSE, ref_pop = "EUR", ref_bfile, seed = 77777 )
LD_prune( dat, clump_kb = 250, clump_r2 = 0.1, Random = TRUE, clump_p1 = 1, local = FALSE, ref_pop = "EUR", ref_bfile, seed = 77777 )
dat |
Output from |
clump_kb |
Clumping window, default is |
clump_r2 |
Clumping r-squared threshold, default is |
Random |
Logical, if |
clump_p1 |
Clumping sig level for index SNPs, default is |
local |
Logical, if |
ref_pop |
Super-population to use as reference panel at the API (when |
ref_bfile |
Path to the bed/bim/fam LD reference (e.g. "1kg.v3/EUR" for local 1000 EUR ref. population file). If |
seed |
Random number seed for random pruning |
Data frame
Plotting model for Slope-Hunter clustering
## S3 method for class 'SH' plot( x, what = c("clusters", "classification", "uncertainty", "density"), xlab = NULL, ylab = NULL, addEllipses = TRUE, main = FALSE, ... )
## S3 method for class 'SH' plot( x, what = c("clusters", "classification", "uncertainty", "density"), xlab = NULL, ylab = NULL, addEllipses = TRUE, main = FALSE, ... )
x |
Output from |
what |
A string specifying the type of graph requested. Available choices are: "clusters": showing clusters. The plot can display membership probabilities of each variable (e.g. SNP) to the target cluster (G1) by hovering over the points. "classification": A plot showing point assigned to each cluster (class). "uncertainty": A plot of classification uncertainty. "density": A plot of estimated density. |
xlab |
Optional label for the x-axis in case of "classification", "uncertainty", or "density" plots. |
ylab |
Optional label for the y-axis in case of "classification", "uncertainty", or "density" plots. |
addEllipses |
A logical indicating whether or not to add ellipses with axes corresponding to the within-cluster covariances in case of "classification" or "uncertainty" plots. |
main |
A logical or NULL indicating whether or not to add a title to the plot identifying the type of plot drawn in case of "classification", "uncertainty", or "density" plots. |
... |
Other graphics parameters. |
Reads in incidence data. Checks and organises columns for use with the Slope-Hunter analyses. Infers p-values when possible from beta and se.
read_incidence( filename, snp_col = "SNP", beta_col = "BETA", se_col = "SE", pval_col = "PVAL", eaf_col = "EAF", effect_allele_col = "EA", other_allele_col = "OA", gene_col = "GENE", chr_col = "CHR", pos_col = "POS", min_pval = 1e-200, log_pval = FALSE )
read_incidence( filename, snp_col = "SNP", beta_col = "BETA", se_col = "SE", pval_col = "PVAL", eaf_col = "EAF", effect_allele_col = "EA", other_allele_col = "OA", gene_col = "GENE", chr_col = "CHR", pos_col = "POS", min_pval = 1e-200, log_pval = FALSE )
filename |
Filename (formatted as .gz, .csv or .txt). Must have header with at least the |
snp_col |
Required name of column with SNP rs IDs. The default is |
beta_col |
Required name of column with effect sizes. The default is |
se_col |
Required name of column with standard errors. The default is |
pval_col |
Name of column with p-value (optional). The default is |
eaf_col |
Name of column with effect allele frequency (optional). The default is |
effect_allele_col |
Required for harmonisation. Name of column with effect allele. Must be "A", "C", "T" or "G". The default is |
other_allele_col |
Required for harmonisation. Name of column with non-effect allele. Must be "A", "C", "T" or "G". The default is |
gene_col |
Optional column for gene name. The default is |
chr_col |
Optional column for chromosome number. The default is |
pos_col |
Optional column for SNP position. The default is |
min_pval |
Minimum allowed p-value. The default is |
log_pval |
The p-value is -log10(P). The default is |
data frame
Reads in prognosis data. Checks and organises columns for use with the Slope-Hunter analyses. Infers p-values when possible from beta and se.
read_prognosis( filename, snp_col = "SNP", beta_col = "BETA", se_col = "SE", pval_col = "PVAL", eaf_col = "EAF", effect_allele_col = "EA", other_allele_col = "OA", gene_col = "GENE", chr_col = "CHR", pos_col = "POS", min_pval = 1e-200, log_pval = FALSE )
read_prognosis( filename, snp_col = "SNP", beta_col = "BETA", se_col = "SE", pval_col = "PVAL", eaf_col = "EAF", effect_allele_col = "EA", other_allele_col = "OA", gene_col = "GENE", chr_col = "CHR", pos_col = "POS", min_pval = 1e-200, log_pval = FALSE )
filename |
Filename. Must have header with at least the |
snp_col |
Required name of column with SNP rs IDs. The default is |
beta_col |
Required name of column with effect sizes. The default is |
se_col |
Required name of column with standard errors. The default is |
pval_col |
Name of column with p-value (optional). The default is |
eaf_col |
Name of column with effect allele frequency (optional). The default is |
effect_allele_col |
Required for harmonisation. Name of column with effect allele. Must be "A", "C", "T" or "G". The default is |
other_allele_col |
Required for harmonisation. Name of column with non-effect allele. Must be "A", "C", "T" or "G". The default is |
gene_col |
Optional column for gene name. The default is |
chr_col |
Optional column for chromosome number. The default is |
pos_col |
Optional column for SNP position. The default is |
min_pval |
Minimum allowed p-value. The default is |
log_pval |
The p-value is -log10(P). The default is |
data frame
Correct index event bias for new data
SHadj( x, dat, snp_col = "SNP", xbeta_col = "BETA.incidence", xse_col = "SE.incidence", ybeta_col = "BETA.prognosis", yse_col = "SE.prognosis" )
SHadj( x, dat, snp_col = "SNP", xbeta_col = "BETA.incidence", xse_col = "SE.incidence", ybeta_col = "BETA.prognosis", yse_col = "SE.prognosis" )
x |
an pbject of the class SH obtained from the |
dat |
A data.frame with harmonised effects and alleles, formatted using the |
snp_col |
Name of column with SNP IDs. |
xbeta_col |
Required name of column with effects on the incidence trait. |
xse_col |
Required name of column with standard errors of |
ybeta_col |
Required name of column with unadjusted effects on the prognosis trait. |
yse_col |
Required name of column with standard errors of |
data.frame with adjusted estimates
Implement the EM algorithm for the SlopeHunter model-based clustering
shclust(gwas, pi0, sxy1)
shclust(gwas, pi0, sxy1)
gwas |
a data frame with columns: xbeta; xse; ybeta; yse. |
pi0 |
initial value for the weight of the mixture component that represents the cluster of SNPs affecting x only. |
sxy1 |
initial value for the covariance between x and y. |
EM fit for SlopeHunter estimator