This role enables you to contribute datasets (create metadata, upload file for QC, check QC report and submit dataset for approval) to OpenGWAS.
You may contact Admin (details TBC) and request to be added as a Contributor. You will be granted access to https://api.opengwas.io/contribution.
You also need R/GwasDataImport installed and your OpenGWAS JWT (token) set up in your R environment.
Each dataset will be a .txt
or .txt.gz
file. The content looks like this:
ID ALT REF BETA SE PVALUE AF N ZVALUE INFO CHROM POS
rs10399878 G A 0.0118 0.016 0.4608 0.9569 124787 NA NA 1 1239953
rs1039063 G T 0.0026 0.0036 0.4702 0.55 236102 NA NA 1 2281978
rs1039100 G A 0.0033 0.0047 0.4826 NA 221290 NA NA 1 2286947
rs10158583 A G 0.0099 0.0059 0.09446 0.075 321197 NA NA 1 3144068
rs10157420 C T -0.0038 0.0075 0.6124 0.05 234171 NA NA 1 3146497
The header row is not required - you will need to specify the mappings in later steps. The header row (if any) will be ignored by the pipeline so it’s okay to leave it as-is.
These columns are required for each dataset:
Note: you need to remove any leading 0
from the
chromosome values, e.g. 07
-> 7
.
Note: a row will be removed by the pipeline entirely if it has NA/Inf value in any required columns.
These columns are optional:
You can track your contributions via the web interface, but you will need R/GwasDataImport anyway to upload the files.
For each dataset you will need to go through step 1 to 4. See also: Import in bulk.
For each dataset to be uploaded, you can either use the web interface or the R/GwasDataImport package to create the metadata.
https://api.opengwas.io/contribution/ provides a webform with dropdowns and tooltips for each field. This is handy when you are new to this process and/or only have a few datasets to upload.
GWAS ID will be available on the web interface for each dataset.
Assume it’s ieu-b-9999
.
In R, specify the path (assume it’s ~/bmi_test.txt.gz
)
and GWAS ID, and then mark the metadata as uploaded.
Alternatively, you can opt to use R if you have an array of candidate datasets since it’s easier to upload metadata of multiple datasets in a programmable way. We recommend that you try the web interface in 1a first before you go down this more advanced route.
Assume the full path to the file is
~/bmi_test.txt.gz
.
# Don't provide the GWAS ID
x <- Dataset$new(filename="~/bmi_test.txt.gz")
x$collect_metadata(list(
trait="TEST - DO NOT USE 2",
group_name="public",
build="HG19/GRCh37",
category="Risk factor",
subcategory="Anthropometric",
ontology="NA",
population="Mixed",
sex="Males and Females",
sample_size=339224,
author="Mendel GJ",
year=2022,
unit="SD"
))
x$api_metadata_upload()
GWAS ID will be returned by the last command. Assume it’s
ieu-b-9999
. At the same time a new record will show up on
https://api.opengwas.io/contribution/.
You can modify the metadata either via the web interface (recommended) or through R/GwasDataImport, regardless of how you created the metadata (i.e. metadata created via the R package can be modified on the web interface, or vice versa).
Note that the metadata can only be modified when there is no QC pipeline associated, because at the last step when generating the QC report, the metadata will be hardcoded in the report. Metadata can only be modified when
Always check that the GWAS ID and path information stored are accurate.
Specify column mapping (1-indexed) (check docs for parameter names):
x$determine_columns(list(
chr_col=11,
pos_col=12,
ea_col=2,
oa_col=3,
beta_col=4,
se_col=5,
pval_col=6,
snp_col=1,
eaf_col=7,
ncontrol_col=8
))
Use the output to double-check the mapping. If necessary, run
x$determine_columns(...)
again with the correct
mapping.
Format the dataset and then upload (both may take a while):
You will see the “Dataset has been added to the pipeline” message if the upload was successful.
And finally don’t forget to clean up the working directory:
On https://api.opengwas.io/contribution you can click the 2. QC tab of the dataset popup and check pipeline state.
For each dataset, you should review the QC report when it’s available and decide whether to submit the dataset for approval or not. You will have the following options:
You may also use the checkboxes on the main screen to select datasets and submit for approval in bulk.
If you have multiple datasets you may want to write an R snippet to semi-automate this process.
Set up for only once, then for each dataset go through step 1b, 2 and 3. Finally visit the portal in step 4 and use the checkboxes to submit in bulk.
Admins will review and approve/reject each datasets. Approved datasets will go into the release pipeline which may take 0.5 - 4 hours. After that you may query the dataset via packages, e.g.:
ieugwasr::tophits("ieu-b-9999")
Note: currently https://gwas.mrcieu.ac.uk/ has a synchronisation issue so new datasets may not be displayed there, but you can always use the new datasets directly via the packages like above.