This document details changes and new features specifically relating to the TwoSampleMR R package and the GWAS database behind it.
We have made a new system for naming datasets, and all datasets are
organised into data batches. Either new datasets are uploaded one at a
time in which case they are added to the ieu-a
data batch,
or there is a bulk upload in which case a new batch is created. For
example, ukb-a
is a bulk upload of the first round of the
Neale lab UKBiobank GWAS, and ukb-b
is the IEU GWAS
analysis of the UKBiobank data. In most cases, a dataset is then
numbered arbitrarily within the batch. For example, the Locke et al 2014
BMI analysis was previously known as 2
, and it is now known
as ieu-a-2
.
There is backward compatibility built into the R packages that access the data, so if you use an ‘old’ ID, it will automatically translate that to the new one. But it will give you a warning, and we urge you to update your scripts to reflect this change.
Previously you would automatically be asked to authenticate any query to the database, through google. Now, we are making authentication voluntary - something that you do at the start of a session only if you need access to specific private datasets on the database. For the vast majority of use cases this is not required.
Another change is that the R package that managed the authentication has updated, and the file tokens generated are slightly different. For full information on how to deal with this, see here: https://mrcieu.github.io/ieugwasr/articles/guide.html#authentication
We conducted a large GWAS analysis using a pipeline that systematically analysed every PHESANT phenotype in UK Biobank. There were previously ~20k traits with complete GWAS data, but a majority of these were binary traits based on very few numbers of cases. We have now filtered out unreliable datasets, there are 2514 traits remaining, with any binary traits removed that had fewer than 1000 cases. Another issue is the combination of small numbers of cases and allele frequency - here minor allele count (MAC) for a particular association could be very small which would lead to high false positives when using Bolt-LMM. The remaining traits have been filtered to only retain associations where the MAC > 90.
Document detailing this investigation here: https://htmlpreview.github.io/?https://raw.githubusercontent.com/MRCIEU/ukbb-gwas-analysis/master/docs/ldsc_clumped_analysis.html?token=AAOV6TBQXEXEPT7SUXXLWMC6DWP3O
Previously the data were QC’d to remove malformed results and then deposited as we found them. We are now also pre-harmonising all the data. This means that all alleles are coded on the forward strand, and the non-effect allele is always aligned to the human genome reference sequence B37 (so the effect allele is the non-reference allele). This does mean that sometimes variants have been removed if they did not map to the human genome, and for most datasets the effect allele has been switched for approximately half of all sites. When an effect allele changes we do of course switch the sign of the effect size, so it should not impact any MR results.
We have updated the LD reference panel to be harmonised against human genome build 37, and as a consequence a few variants have been lost from the version that was previously used.
Previously we were pre-clumping the tophits and storing them in the MRInstruments R package, and there was often a delay in updating the MRInstruments R package after new datasets were uploaded to the database. We have moved away from this model. Everything dataset is pre-clumped, but that is stored in the database. If you request default clumping values when extracting the tophits of a dataset, it will still be fast but it is retrieving the data from the server, and not from the MRInstruments package. You can continue to use the MRInstruments package for GWAS hits from e.g. GTEx or the EBI GWAS catalog.
All rs IDs have been mapped to dbSNP build 144. Therefore, some rs IDs may have changed, but there is stronger alignment across all datasets.
We are using Elasticsearch and Neo4j on an Oracle Cloud Infrastructure to serve the data. It’s much faster. Interestingly, it actually gets faster when more people are using it because the cache gets ‘warmed up’ by more requests.
We have a new home for the GWAS summary data: https://gwas.mrcieu.ac.uk/.
All variants have been mapped to chromosome and position
(hg19/build37). You can query based on chromosome position coordinates.
This means either a list of <chr:pos>
values, or a
list of <chr:pos1-pos2>
ranges.
Previously we were excluding these, but they are now retained
Previously we were excluding these, but they are now retained. Be warned that if you extract a variant that has multiple alleles then you may get more than one row for that variant.
Automated download from the EBI repository, and an automated upload system and batch data processing system means that more data can be added faster to keep the database current.
Previously if a query to the database failed it didn’t give a reason, hopefully there is more clarity regarding what is happening now. You can also check the status of the server here: https://api.opengwas.io/api/
We are trying to make it as flexible as possible to access the data. The TwoSampleMR R package was previously the only programmatic way to access the data, now we have the following options:
It is now possible to perform clumping, or create LD matrices, using your own local LD reference dataset. You can download the one that we have been using here: https://github.com/mrcieu/gwasglue#reference-datasets, or create your own plink format dataset e.g. with larger samples or for different ancestries. See the LD clumping functions in the ieugwasr package for more details.
Previously the data was only accessible through the database. Now the data can be downloaded in “GWAS VCF” format from here https://gwas.mrcieu.ac.uk/. (IEU members can access all the data on RDSF or bluecrystal4 directly). This means that if you want to perform very large or numerous operations, you can do it on HPC or locally in a more performant manner by using the data files directly. Please see the gwasvcf R package on how to work with these data.
Either the data in the database, or the GWAS VCF files, can be queried and the results translated into the formats for a bunch of different R packages for MR, colocalisation, fine mapping, etc. Have a look at the gwasglue R package, to see what is available and how to do this. It’s still under construction, but feel free to try it, make suggestions, and contribute code.
We have setup a github issues page here: https://github.com/MRCIEU/opengwas-requests/issues
Please visit here to make a log of new data requests, or to contribute new data.
To install the new version of TwoSampleMR, perform as normal:
To update the package just run the
remotes::install_github("MRCIEU/TwoSampleMR")
command
again.
We recommend using this new version going forwards but for a limited time we are enabling backwards compatibility, in case you are in the middle of analysis or need to reproduce old analysis. In order to use the legacy version of the package and the database, install using:
install.packages("remotes")
remotes::install_github("MRCIEU/[email protected]")