Clumping GWAS results aims to retain only independent variants,
accounting for linkage disequilibrium (highly correlated signals). The
clump()
function uses PLINK2 in the background and requires
a genome reference dataset with linkage disequilibrium information. This
reference must contain the same variant IDs as your GWAS results. It is
recommended to first annotate your GWAS with rsIDs - this can be
achieved with the chrpos_to_rsid()
function if your data is
not already annotated. The data is also expected in genepi.utils
standard form, so run standardise_gwas()
if needed. The
standardisation and rsID annotation can also be performed in one step
with the standardise_gwas()
function, by passing an
appropriate populate_rsid
argument.
A common reference dataset is the 1000 genomes phase 3 which can be downloaded from the PLINK (https://www.cog-genomics.org/plink/2.0/resources#phase3_1kg). Important: if you use a different dataset ensure that the IDs match those in your dataset).
To ready the PLINK2 pfiles for use with clumping you can run the following, as there appear to be some duplicate rsIDs in this dataset that PLINK2 does not like.
The path to your processed reference files
(path/to/plink_ref/all_phase3_nodup
) can be directly passed
the the clump()
function as the plink_ref
argument, or, you can configure the path in the package by using
set_1000G_reference()
- this only needs to be done once per
install. The currently set reference path can be queried with
which_1000G_reference()
.
You will need PLINK2 installed on your computer. The
clump()
function takes a path to the plink executable as
the plink2
argument. Alternatively, you can configure the
path in the package by using set_plink2()
- this only needs
to be done once per install. The currently set PLINK2 path can be
queried with which_plink2()
. If the plink2
argument is set to NULL
the function will attempt to call
plink directly, equivalent of running plink2 --options
directly in your command line.
Running the clumping procedure will annotate your GWAS with two columns:
index
: a logical column indicating whether the variant
is an index/lead variant. If you only want clumped/index SNPs you should
filter on this e.g. gwas[index==TRUE, ]
clump
: an integer column indicating the clump that the
variant belongs to.