Package 'mrclust'

Title: Identifying Clustered Heterogeneity in Mendelian Randomization Analyses
Description: Performs likelihood based clustering on univariate observations with known uncertainty (via standard error data), whilst accounting for possible null and junk components in the sample.
Authors: Christopher Neal Foley
Maintainer: The package maintainer <[email protected]>
License: GPL-3
Version: 0.1.0
Built: 2024-09-26 04:45:09 UTC
Source: https://github.com/cnfoley/mrclust

Help Index


Genetic association data from diastolic blood pressure (DBP) and coronary artery disease (CAD) GWAS.

Description

A dataset containing chromosome position, rsid and allele information as well as estimates of the regression coefficients and associated standards errors from the SBP and CAD GWAS.

Usage

DBP_CAD

Format

A data frame with 119 rows and 8 variables:

chr.pos

chromosome position

rsid

RSID

bx

estimated regression coefficient with risk-factor, SBP

bxse

standard error of estimated regression coefficient with risk-factor, SBP

by

estimated regression coefficient with outcome, CAD

byse

standard error of estimated regression coefficient with outcome, CAD

a1

a1 allele

a2

a2 allele

Source

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6284793/

http://www.phenoscanner.medschl.cam.ac.uk


MR-Clust mixture model fitting

Description

Assessment of clustered heterogeneity in Mendelian randomization analyses using expectation-maximisation (EM) based model fitting of the MR-Clust mixture model. Function output includes both data-tables and a visualisation of the assingment of variants to clusters.

Usage

mr_clust_em(
  theta,
  theta_se,
  bx,
  by,
  bxse,
  byse,
  obs_names = NULL,
  max_iter = 5000,
  tol = 1e-05,
  junk_sd = NULL,
  junk_mean = 0,
  stop_bic_iter = 5,
  min_clust_search = 10,
  results_list = list("all", "best"),
  cluster_membership = list(by_prob = 0.1, bound = 0),
  plot_results = list("best", min_pr = 0.5),
  trait_search = FALSE,
  trait_pvalue = 1e-05,
  proxy_r2 = 0.8,
  catalogue = "GWAS",
  proxies = "None",
  build = 37
)

Arguments

theta

numeric vector of length the number of variants, the i-th element is a ratio-estimate for the i-th genetic variant.

theta_se

numeric vector of length the number of variants, the i-th element is the standard error of the ratio-estimate for the i-th genetic variant.

bx

numeric vector of length the number of variants, the i-th element is the estimated regression coefficient - i.e. beta-x value - relating the i-th genetic variant to the risk-factor.

by

numeric vector of length the number of variants, the i-th element is the estimated regression coefficient - i.e. beta-y value - relating the i-th genetic variant to the outcome.

bxse

numeric vector of length the number of variants, the i-th element is the standard error of the estimated regression coefficient relating the i-th genetic variant to the risk-factor.

byse

numeric vector of length the number of variants, the i-th element is the standard error of the estimated regression coefficient relating the i-th genetic variant to the outcome.

obs_names

character vector of length the number of variants, the i-th element is the name of the i-th genetic variants - e.g. the rsID.

max_iter

numeric integer denoting the maximum number of iterations to take before stopping the EM-algorithm's search for a maxima in the log-likelihood.

tol

numeric scalar denoting the maximum absolute difference between two computations of the log-likelihood with which we accept that a maxima in the log-likelihood has been computed.

junk_sd

numeric scalar denoting the scale parameter in the generalised t-distribution

junk_mean

numeric scalar denoting the mean of the generalised t-distribution. By default mean is set to zero.

stop_bic_iter

numeric integer I, for computational efficiency - particularly when analysing large numbers of variants - we can stop the EM-algorithm if the BIC is monotonic increasing over the previous I increases in the number of clusters K. By default evidence supporting at least 10 clusters in the data is computed and so, for example, if the BIC from models which assume 6 clusters; 7 clusters; ... or; 10 clusters is monotonic increasing - in the number of clusters K -then the EM-algorithm is stopped and the model whose K minimises the BIC is returned.

min_clust_search

numeric integer which denotes the minimum number of clusters searched for in the data - default computes evidence supporting up to K=10 clusters which might explain any clustered heterogeneity in the data.

results_list

character list allowing users to choose whether to return a table with the variants assigned to: "all" of the clusters; a single "best" cluster or; both. By default we return both, i.e. results_list = list("all", "best").

cluster_membership

numeric list which allows users to output a list which, for each cluster, returns the variants assigned to the cluster by stratified by the probability of belonging to the cluster. By default, cluster_membership = list(by_prob = 0.1, bound = 0); so that MRClust returns a list, which for each cluster, outputs the variants assigned to the cluster with probability between (0.9,1); (0.8,0.9);... and finally; (0.1,0), i.e. by probability increments 0.1 from 1 to a lower bound of 0.

plot_results

numeric list which allows users to plot the output of MRClust. By default, plot_results = list("best", min_pr = 0.5); so that the best clustering is plotted with variants assigned to a cluster with probability above 0.5.

trait_search

logical, for each of the non-null and non-junk clusters search phenoscanner for traits associated with the variants.

trait_pvalue

numeric scalar for use with trait_search, representing the maximum p-value with with at least one variant in the cluster must be associated with a trait for it to be returned in the phenoscanner search. Default value is GWA significance, i.e. 5*10^-8.

proxy_r2

numeric scalar for use with trait search, allowing variants whose r2>=proxy_r2 to be included in the trait search. Default r2=0.8.

catalogue

character, for use with trait search. From Phenoscanner (http://www.phenoscanner.medschl.cam.ac.uk/information/) "the catalogue to be searched (options: None, GWAS, eQTL, pQTl, mQTL, methQTL)". Default setting is catalogue = "GWAS".

proxies

character, for use with trait search. From Phenoscanner (http://www.phenoscanner.medschl.cam.ac.uk/information/) "the proxies database to be searched (options: None, AFR, AMR, EAS, EUR, SAS)". Default setting is proxies = "None"

build

integer, for use with trait search. From Phenoscanner (http://www.phenoscanner.medschl.cam.ac.uk/information/) "Human genome build numbers (options: 37, 38; default: 37)". Default setting is build = 37.

Value

Returned are: estimates of the putative number of clusters in the sample, complete with allocation probabilities and summaries of the association estimates for each variant; plots which visualise the allocation of variants to clusters and; several summaries of the fitting process, i.e. the BIC and likelihood estimates.


Genetic association data from pulse pressure (PP) and coronary artery disease (CAD) GWAS.

Description

A dataset containing chromosome position, rsid and allele information as well as estimates of the regression coefficients and associated standards errors from the SBP and CAD GWAS.

Usage

PP_CAD

Format

A data frame with 121 rows and 8 variables:

chr.pos

chromosome position

rsid

RSID

bx

estimated regression coefficient with risk-factor, SBP

bxse

standard error of estimated regression coefficient with risk-factor, SBP

by

estimated regression coefficient with outcome, CAD

byse

standard error of estimated regression coefficient with outcome, CAD

a1

a1 allele

a2

a2 allele

Source

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6284793/

http://www.phenoscanner.medschl.cam.ac.uk


Cluster size and assignemnt probabilities

Description

Keep results based on a minimum allocation probability and number of observations in a cluster.

Usage

pr_clust(dta, prob = 0.5, min_obs = 1)

Arguments

dta

table of results from mr_clust_em$results$best.

prob

numeric scalar, keep only variants assigned to clusters above this allocation probability.

min_obs

integer, keep only variants assinged to clusters with more than or equal to min_obs members.

Value

The results


Genetic association data from systolic blood pressure (SBP) and coronary artery disease (CAD) GWAS.

Description

A dataset containing chromosome position, rsid and allele information as well as estimates of the regression coefficients and associated standards errors from the SBP and CAD GWAS.

Usage

SBP_CAD

Format

A data frame with 121 rows and 8 variables:

chr.pos

chromosome position

rsid

RSID

bx

estimated regression coefficient with risk-factor, SBP

bxse

standard error of estimated regression coefficient with risk-factor, SBP

by

estimated regression coefficient with outcome, CAD

byse

standard error of estimated regression coefficient with outcome, CAD

a1

a1 allele

a2

a2 allele

Source

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6284793/

http://www.phenoscanner.medschl.cam.ac.uk


Plotting clustered ratio-estimates

Description

Plot of the two-stage regression estimates, i.e. G-X and G-Y associations, annotated with cluster allocation labels and cluster mean estimates.

Usage

two_stage_plot(res, bx, by, bxse, byse, obs_names)

Arguments

res

table of results from mr_clust_em$results$best.

bx

numeric vector of length the number of variants, the i-th element is the estimated regression coefficient - i.e. beta-x value - relating the i-th genetic variant to the risk-factor.

by

numeric vector of length the number of variants, the i-th element is the estimated regression coefficient - i.e. beta-y value - relating the -th genetic variant to the outcome.

bxse

numeric vector of length the number of variants, the i-th element is the standard error of the estimated regression coefficient relating the i-th genetic variant to the outcome.

byse

numeric vector of length the number of variants, the i-th element is the standard error of the estimated regression coefficient relating the i-th genetic variant to the risk-factor.

obs_names

character vector of length the number of variants, the i-th element is the name of the i-th genetic variants - e.g. the rsID.

Value

Returned is a scatter plot of the two-stage association estimates for each variant in which: clusters are colour coded and variants with larger assignement/inclusion probabilities appear larger.