Title: | Import EBI Data to OpenGWAS |
---|---|
Description: | Determine the new EBI data not present in OpenGWAS. Download dataset and import metadata. Upload processed data and metadata to OpenGWAS. |
Authors: | Gibran Hemani [aut, cre] , Philip Haycock [aut] , Tom Palmer [aut] |
Maintainer: | Gibran Hemani <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2 |
Built: | 2025-01-08 06:31:25 UTC |
Source: | https://github.com/MRCIEU/GwasDataImport |
List of EBI datasets that are currently being processed
being_processed(dat)
being_processed(dat)
dat |
Output from |
Updated dat
Create dataset of some hm3 SNPs and their build positions
create_build_reference()
create_build_reference()
saves build_ref object
Slow to create these so just make once and save
create_marts()
create_marts()
Saves data to data/marts.rdata
Object that downloads, develops and uploads GWAS summary datasets for IEU OpenGWAS database
Object that downloads, develops and uploads GWAS summary datasets for IEU OpenGWAS database
filename
Path to raw GWAS summary dataset
igd_id
ID to use for upload. If NULL then the next available ID in batch ieu-b will be used automatically
wd
Work directory in which to save processed files. Will be deleted upon completion
gwas_out
path to processed summary file
nsnp_read
Number of SNPs read initially
nsnp
Number of SNPs retained after reading
metadata
List of meta-data entries
metadata_test
List of outputs from tests of the effect allele, effect allele frequency columns and summary data using CheckSumStats
metadata_file
Path to meta-data json file
datainfo
List of GWAS file parameters
datainfo_file
Path to datainfo json file
params
Initial column identifiers specified for raw dataset
metadata_uploaded
TRUE/FALSE of whether the metadata has been uploaded
gwasdata_uploaded
TRUE/FALSE of whether the gwas data has been uploaded
metadata_upload_status
Response from server about upload process
gwasdata_upload_status
Response from server about upload process
new()
Initialise
Dataset$new(filename = NULL, wd = tempdir(), igd_id = NULL)
filename
Path to raw GWAS summary data file
wd
Path to directory to use as the working directory. Will be deleted upon completion - best to keep as the default randomly generated temporary directory
igd_id
Option to provide a specified ID for upload. If none provided then will use the next ieu-a batch ID
new ObtainEbiDataset object
is_new_id()
Check if the specified ID is unique within the database. It checks published GWASs and those currently being processed
Dataset$is_new_id(id = self$igd_id)
id
ID to check
delete_wd()
Delete working directory
Dataset$delete_wd()
set_wd()
Set working directory (creates)
Dataset$set_wd(wd)
wd
working directory
se_from_bp()
Estimate standard error from beta and p-value
Dataset$se_from_bp(beta, pval, minp = 1e-300)
beta
Effect size
pval
p-value
minp
Minimum p-value cutoff default = 1e-300
determine_columns()
Specify which columns in the dataset correspond to which fields.
Dataset$determine_columns(params, nrows = 100, gwas_file = self$filename, ...)
params
List of column identifiers. Identifiers can be numeric position or column header name. Required columns are: c("chr_col", "pos_col", "ea_col", "oa_col", "beta_col", "se_col", "pval_col","rsid_col"). Optional columns are: c("snp_col", "eaf_col", "oaf_col", "ncase_col", "imp_z_col", "imp_info_col", "ncontrol_col").
nrows
How many rows to read to check that parameters have been specified correctly
gwas_file
Filename to read
...
Further arguments to pass to data.table::fread in order to correctly read the dataset
format_dataset()
Process dataset ready for uploading. Determins build and lifts over to hg19/b37 if necessary.
Dataset$format_dataset( gwas_file = self$filename, gwas_out = file.path(self$wd, "format.txt.gz"), params = self$params, metadata_test = self$metadata_test, ... )
gwas_file
GWAS filename
gwas_out
Filename to save processed dataset to
params
Column specifications (see determine_columns for more info)
metadata_test
List of outputs from tests of the effect allele, effect allele frequency columns and summary data using CheckSumStats
...
Further arguments to pass to data.table::fread in order to correctly read the dataset
view_metadata_options()
View the specifications for available meta data fields, as taken from https://api.opengwas.io/api/docs
Dataset$view_metadata_options()
get_gwasdata_fields()
Get a list of GWAS data fields and whether or not they are required
Dataset$get_gwasdata_fields()
data.frame
get_metadata_fields()
Get a list of metadata fields and whether or not they are required
Dataset$get_metadata_fields()
data.frame
collect_metadata()
Input metadata
Dataset$collect_metadata(metadata, igd_id = self$igd_id)
metadata
List of meta-data fields and their values, see view_metadata_options for which fields need to be inputted.
igd_id
ID to be used for uploading to the database
check_meta_data()
Check that the reported effect allele and effect allele frequency columns are correct.
Dataset$check_meta_data( gwas_file = self$filename, params = self$params, metadata = self$metadata )
gwas_file
Filename to read
params
column names from x$determine_columns(). Required columns are: c("snp_col", "ea_col", "oa_col", "eaf_col" )
metadata
metadata from x$collect_metadata()
write_metadata()
Write meta data to json file
Dataset$write_metadata( metadata = self$metadata, datainfo = self$datainfo, outdir = self$wd )
metadata
List of meta data fields and their values
datainfo
List of data column parameters
outdir
Output directory to write json files
api_metadata_upload()
Upload meta data to API
Dataset$api_metadata_upload( metadata = self$metadata, metadata_test = self$metadata_test, access_token = ieugwasr::check_access_token() )
metadata
List of meta data fields and their values
metadata_test
List of outputs from tests of the effect allele, effect allele frequency columns and summary data using CheckSumStats
access_token
Google OAuth2.0 token. See ieugwasr documentation for more info
api_metadata_edit()
Upload meta data to API
Dataset$api_metadata_edit( metadata = self$metadata, access_token = ieugwasr::check_access_token() )
metadata
List of meta data fields and their values
access_token
Google OAuth2.0 token. See ieugwasr documentation for more info
api_metadata_check()
View meta-data
Dataset$api_metadata_check( id = self$igd_id, access_token = ieugwasr::check_access_token() )
id
ID to check
access_token
Google OAuth2.0 token. See ieugwasr documentation for more info
api_metadata_delete()
Delete a dataset. This deletes the metadata AND any uploaded GWAS data (and related processing files)
Dataset$api_metadata_delete( id = self$igd_id, access_token = ieugwasr::check_access_token() )
id
ID to delete
access_token
Google OAuth2.0 token. See ieugwasr documentation for more info
api_gwasdata_upload()
Upload gwas dataset
Dataset$api_gwasdata_upload( datainfo = self$datainfo, gwasfile = self$gwas_out, metadata_test = self$metadata_test, access_token = ieugwasr::check_access_token() )
datainfo
List of data column parameters
gwasfile
Path to processed gwasfile
metadata_test
List of outputs from tests of the effect allele, effect allele frequency columns and summary data using CheckSumStats
access_token
Google OAuth2.0 token. See ieugwasr documentation for more info
api_gwasdata_check()
Check status of API processing pipeline
Dataset$api_gwasdata_check( id = self$igd_id, access_token = ieugwasr::check_access_token() )
id
ID to check
access_token
Google OAuth2.0 token. See ieugwasr documentation for more info
api_gwasdata_delete()
Delete a dataset. This deletes the metadata AND any uploaded GWAS data (and related processing files)
Dataset$api_gwasdata_delete( id = self$igd_id, access_token = ieugwasr::check_access_token() )
id
ID to delete
access_token
Google OAuth2.0 token. See ieugwasr documentation for more info
api_qc_status()
Check the status of the GWAS QC processing pipeline
Dataset$api_qc_status( id = self$igd_id, access_token = ieugwasr::check_access_token() )
id
ID to delete
access_token
Google OAuth2.0 token. See ieugwasr documentation for more info
api_report()
View the html report for a processed dataset
Dataset$api_report( id = self$igd_id, access_token = ieugwasr::check_access_token() )
id
ID of report to view
access_token
Google OAuth2.0 token. See ieugwasr documentation for more info
api_gwas_release()
Release a dataset
Dataset$api_gwas_release( comments = NULL, passed_qc = "True", id = self$igd_id, access_token = ieugwasr::check_access_token() )
comments
Optional comments to provide when uploading
passed_qc
True or False
id
ID to release
access_token
Google OAuth2.0 token. See ieugwasr documentation for more info
clone()
The objects of this class are cloneable with this method.
Dataset$clone(deep = FALSE)
deep
Whether to make a deep clone.
Determine build based on reference dataset
determine_build(rsid, chr, pos, build = c(37, 38, 36), fallback = "position")
determine_build(rsid, chr, pos, build = c(37, 38, 36), fallback = "position")
rsid |
rsid |
chr |
chr |
pos |
pos |
build |
Builds to try e.g. c(37,38,36) |
fallback |
Whether to try "position" (fast) or "biomart" (more accurate if you have rsids) based approaches instead |
build if detected, or dataframe of matches if not
Determines which build a set of SNPs are on
determine_build_biomart(rsid, chr, pos, build = c(37, 38, 36))
determine_build_biomart(rsid, chr, pos, build = c(37, 38, 36))
rsid |
rsid |
chr |
chr |
pos |
pos |
build |
Builds to try e.g. c(37,38,36) |
build if detected, or dataframe of matches if not
A bit sketchy but computationally fast - just assumes that there will be at least 50x more position matches in the true build than either of the others.
determine_build_position(pos, build = c(37, 38, 36))
determine_build_position(pos, build = c(37, 38, 36))
pos |
Vector of positions |
build |
c(37,38,36) |
build or if not determined then dataframe
Figure out which datasets are not present in the database
determine_new_datasets( ebi_ftp_url = options()$ebi_ftp_url, blacklist = NULL, exclude_multi_datasets = TRUE )
determine_new_datasets( ebi_ftp_url = options()$ebi_ftp_url, blacklist = NULL, exclude_multi_datasets = TRUE )
ebi_ftp_url |
FTP url default=options()$ebi_ftp_url |
blacklist |
List of EBI datasets to ignore default=NULL |
exclude_multi_datasets |
If a EBI ID has more than one dataset then should it be ignored |
data frame
Convert output from listftp into something that is easier to read
ebi_datasets(ebi_ftp_url = options()$ebi_ftp_url)
ebi_datasets(ebi_ftp_url = options()$ebi_ftp_url)
ebi_ftp_url |
EBI FTP default=options()$ebi_ftp_url |
data frame
Object that downloads, develops and uploads EBI dataset
Object that downloads, develops and uploads EBI dataset
GwasDataImport::Dataset
-> EbiDataset
ebi_id
EBI ID to look for
traitname
Name of trait
ftp_path
Path to files in EBI FTP
or_flag
TRUE/FALSE if had to convert OR to beta
gwas_out1
Path to first look at EBI dataset
GwasDataImport::Dataset$api_gwas_release()
GwasDataImport::Dataset$api_gwasdata_check()
GwasDataImport::Dataset$api_gwasdata_delete()
GwasDataImport::Dataset$api_gwasdata_upload()
GwasDataImport::Dataset$api_metadata_check()
GwasDataImport::Dataset$api_metadata_delete()
GwasDataImport::Dataset$api_metadata_edit()
GwasDataImport::Dataset$api_metadata_upload()
GwasDataImport::Dataset$api_qc_status()
GwasDataImport::Dataset$api_report()
GwasDataImport::Dataset$check_meta_data()
GwasDataImport::Dataset$collect_metadata()
GwasDataImport::Dataset$delete_wd()
GwasDataImport::Dataset$determine_columns()
GwasDataImport::Dataset$format_dataset()
GwasDataImport::Dataset$get_gwasdata_fields()
GwasDataImport::Dataset$get_metadata_fields()
GwasDataImport::Dataset$is_new_id()
GwasDataImport::Dataset$se_from_bp()
GwasDataImport::Dataset$set_wd()
GwasDataImport::Dataset$view_metadata_options()
GwasDataImport::Dataset$write_metadata()
new()
Initialise object
EbiDataset$new( ebi_id, wd = tempdir(), ftp_path = NULL, igd_id = paste0("ebi-a-", ebi_id), traitname = NULL )
ebi_id
e.g. GCST005522
wd
Directory in which to download and develop dataset. Default=tempdir(). Deleted automatically upon object removal
ftp_path
Pre-specified path to data. Default=NULL
igd_id
Defaults to "ebi-a-<ebi_id>"
traitname
Option to provide traitname of dataset
A new EbiDataset object
download_dataset()
Download
EbiDataset$download_dataset( ftp_path = self$ftp_path, ftp_url = options()$ebi_ftp_url, outdir = self$wd )
ftp_path
Pre-specified path to data. Default=self$ftp_path
ftp_url
Default=options()$ebi_ftp_url
outdir
Default=self$wd
format_ebi_dataset()
organise data before formatting. This is slow but doesn't really matter
EbiDataset$format_ebi_dataset( filename = self$filename, output = file.path(self$wd, "step1.txt.gz") )
filename
Filename of GWAS dataset
output
Where to save formatted dataset
organise_metadata()
Download and parse metadata
EbiDataset$organise_metadata( ebi_id = self$ebi_id, or_flag = self$or_flag, igd_id = self$igd_id, units = NULL, sex = "NA", category = "NA", subcategory = "NA", build = "HG19/GRCh37", group_name = "public", traitname = self$traitname )
ebi_id
Default=self$ebi_id
or_flag
Default=self$or_flag
igd_id
Default=NULL
units
Default=NULL
sex
Default="NA"
category
Default="NA"
subcategory
Default="NA"
build
Default="HG19/GRCh37"
group_name
Default="public"
traitname
Default=self$traitname
pipeline()
Once initialised this function will string together everything i.e. downloading, processing and uploading
EbiDataset$pipeline()
clone()
The objects of this class are cloneable with this method.
EbiDataset$clone(deep = FALSE)
deep
Whether to make a deep clone.
Get harmonised file for specific EBI ID
get_ftp_path(ebi_id, ebi_ftp_url = options()$ebi_ftp_url)
get_ftp_path(ebi_id, ebi_ftp_url = options()$ebi_ftp_url)
ebi_id |
EBI ID e.g. GCST000879 |
ebi_ftp_url |
EBI FTP default=options()$ebi_ftp_url |
ftp path (excluding server)
Lookup positions for given rsids in particular build
get_positions( rsid, build = 37, method = c("opengwas", "biomart")[1], splitsize = 50000 )
get_positions( rsid, build = 37, method = c("opengwas", "biomart")[1], splitsize = 50000 )
rsid |
rsid |
build |
build (36, 37 default or 38) |
method |
"opengwas" (fastest) or "biomart" |
splitsize |
Default 50000 |
data frame
Determine GWAS build and liftover to required build
liftover_gwas( dat, build = c(37, 38, 36), to = 37, chr_col = "chr", pos_col = "pos", snp_col = "snp", ea_col = "ea", oa_col = "oa", build_fallback = "position" )
liftover_gwas( dat, build = c(37, 38, 36), to = 37, chr_col = "chr", pos_col = "pos", snp_col = "snp", ea_col = "ea", oa_col = "oa", build_fallback = "position" )
dat |
Data frame with chr, pos, snp name, effect allele, non-effect allele columns |
build |
The possible builds to check data against Default = c(37,38,26) |
to |
Which build to lift over to. Default=37 |
chr_col |
Name of chromosome column name. Required |
pos_col |
Name of position column name. Required |
snp_col |
Name of SNP column name. Optional. Uses less certain method of matching if not available |
ea_col |
Name of effect allele column name. Optional. Might lead to duplicated rows if not presented |
oa_col |
Name of other allele column name. Optional. Might lead to duplicated rows if not presented |
build_fallback |
Whether to try "position" (fast) or "biomart" (more accurate if you have rsids) based approaches instead |
Data frame
List all files on the EBI FTP server
listftp(url = options()$ebi_ftp_url, recursive = TRUE)
listftp(url = options()$ebi_ftp_url, recursive = TRUE)
url |
FTP url to look up |
recursive |
If false then just the top directory, otherwise list recursively |
Vector of paths