The gwasglue2 package sits between the data and the analytical methods within the OpenGWAS ecosystem
Here are some example ‘structures’ of summary datasets that it should be able to handle
Note that the whole OpenGWAS database is an example of a ‘complete’ dataset - all traits and all variants. What we typically want to represent in gwasglue2 is a specific slice of that data, that is almost always going to be a rectangular shape - a set of variants across a set of traits. In addition, we need to be able to be able to flexibly associate metadata or other forms of annotations (LD matrices, genomic annotations etc) to the data. Finally, analytical tools need to be aware of these structures in order to easily deploy methods upon them.
This is typically how a workflow might look:
We should be able to ingest data from various locations, and then the class provides a number of different ways to manipulate and annotate the data.
A set of genetic associations (1 or more) for a single trait is known
as a SummarySet
. This class should contain the data, plus
MetaData
that describes the source study, plus a list of
Concept
s that annotate the Summary data.
A Concept
is a named list of extra information that can
be linked to the data, for example LD matrices, annotations relating to
how to use the data in a model (e.g. exposure or outcome), etc.
A Dataset
is a set of merged and harmonised
SummarySet
s, for example if two SummarySets are generated
using the same set of variants for two traits, then putting them
together into a Dataset
will find the intersect of the
available variants, and harmonise them to be on the same effect allele.
Dataset
objects can also have a global Concept
list for extra data that is common across all the
SummarySet
s stored in the object.
Initial visualisation below:
Currently planning to implement as an S4 class, but may move into R7 in the future once it is stable and compatible with roxygen etc.
Association data within a SummarySet
should have the
following fields -
Metadata fields -
Accession info should be generated upon creation:
gwasglue2
has several constructors to build the
SummarySet
and DataSet
objects, as well as
getter and setter methods to add or retrieve information within.
To build the SummarySet-class
object, we need just one
constructor function, the create_summaryset()
. It calls
different create_summaryset_from_()
, depending of the
source type of the GWAS summary data and also creates metadata using
information from the summary data if none is given. It is also possible
to build a metadata object using create_metadata()
and
input it in create_summaryset()
. Fig. 5 shows the structure
design of SummarySet
with the getters and setters on the
left and constructors on the right.
For DataSet-class
object we use the
create_dataset()
constructor. If argument
harmonise = TRUE
it creates an harmonised
DataSet
. We use the function harmonise_ld()
to
harmonise the DataSet
against a LD matrix from a reference
population. The meta_analysis()
function statistically
combines two or more SummarySets
within the
DataSet
and creates a new SummarySet
object.
With add_summaryset()
it is possible to add an existent
SummarySet
to the DataSet
and harmonise again.
Two or more DataSets
can also be merged together with
merge_datasets()
. The structure design for
DataSet
is in Fig. 6, with all the getters and setters on
the left and constructors on the right.
Opposite to gwasglue
,
gwasglue2
does not uses rsids
when harmonising
the data or in other analyses. Instead, a variantid
is
created using the chromosomal (chr) position (pos) and effect (ea) and
non-effect (nea) alleles, taking the form of
chr:pos_ea_nea
. If the allele has more than 10 characters
(e.g. indels), gwasglue2
will hash it using the
murmur32
algorithm from the R digest
package.
The GWAS summary data is also standardised when creating a
SummarySet
and transformed for the alleles to be in
alphabetical order. Thus, the effect allele will always be the first
alphabetically.
DATA | CHR | POS | EA | NEA | EA Frequency | BETA | variantid |
---|---|---|---|---|---|---|---|
raw | 1 | 123 | A | G | 0.3 | 1 | |
gwasglue2 | 1 | 123 | A | G | 0.3 | 1 | 1:123_A_G |
raw | 1 | 123 | G | A | 0.3 | 1 | |
gwasglue2 | 1 | 123 | A | G | 0.7 | -1 | 1:123_A_G |
raw | 1 | 123 | GTAGTAGTAGTA | A | 0.3 | 1 | |
gwasglue2 | 1 | 123 | A | GTAGTAGTAGTA | 0.7 | -1 | 1:123_A_#bf37641c |