| Title: | Metabolomics data preparation and processing pipeline |
|---|---|
| Description: | | Reads in raw Metabolon, Nightingale Health, Olink, and SomaLogic xls sheets, or flat text files and aids in data preparation of all metabolomics & proteomics data sets. |
| Authors: | Laura Corbin [aut], David Hughes [aut, cre], Nicholas Sunderland [aut], Matthew Lee [aut], Alec McKinlay [aut] |
| Maintainer: | David Hughes <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.1 |
| Built: | 2026-05-07 09:33:11 UTC |
| Source: | https://github.com/MRCIEU/metaboprep |
This function adds an additional layer of data along the third dimension to an existing 3D array (or 2D matrix/vector) by stacking a new layer of data. It ensures that the dimensions of the new layer match the first two dimensions of the existing array or matrix. If there is a mismatch in row or column names and the 'force' parameter is set to 'TRUE', the function will align the data by filling missing values with 'NA'. It is used internally and not intended for routine user use.
add_layer(current, layer, layer_name, force = FALSE)add_layer(current, layer, layer_name, force = FALSE)
current |
A vector, matrix, or 3D array representing the current stack of data. |
layer |
A matrix or array that represents the new layer of data to be added. It should match the dimensions of the first two dimensions of 'current'. |
layer_name |
A character string specifying the name of the new dimension for the 3rd axis. This can be used to annotate the new data layer. |
force |
A logical value indicating whether to force the join and create 'NA' values where row or column names do not match between 'current' and 'layer'. Default is 'FALSE'. |
A 3D array with the added layer in the third dimension.
Scans the package source files for functions starting with "read_" to determine supported data formats.
available_data_formats()available_data_formats()
A named character vector of available data formats.
Scans the package source files for available report templates to write to.
available_report_templates()available_report_templates()
A character vector of available report templates
Run batch normalisation based on the platform flag in the features data
batch_normalise( metaboprep, run_mode_col, run_mode_colmap, source_layer = "input", dest_layer = "batch_normalised" )batch_normalise( metaboprep, run_mode_col, run_mode_colmap, source_layer = "input", dest_layer = "batch_normalised" )
metaboprep |
an object of class Metaboprep |
run_mode_col |
character, column name in features data containing the run mode |
run_mode_colmap |
named character vector or list, c(mode = "mode col name in samples") |
source_layer |
character, which data layer to get the data from |
dest_layer |
character, which data layer to put the the data in to |
A 'Metaboprep' object is a container for matrices of metabolite data, along with associated metadata. It allows for efficient storage and manipulation of data, supporting quality control, transformations, and various analyses. This object facilitates easy access to data layers, sample and feature summaries, outlier treatment, and more.
Metaboprep( data, samples, features, exclusions = list(samples = list(user_excluded = character(), extreme_sample_missingness = character(), user_defined_sample_missingness = character(), user_defined_sample_totalpeakarea = character(), user_defined_sample_pca_outlier = character()), features = list(user_excluded = character(), extreme_feature_missingness = character(), user_defined_feature_missingness = character())), feature_summary = array(data = NA_real_, dim = c(0, 0, 0)), sample_summary = array(data = NA_real_, dim = c(0, 0, 0)) )Metaboprep( data, samples, features, exclusions = list(samples = list(user_excluded = character(), extreme_sample_missingness = character(), user_defined_sample_missingness = character(), user_defined_sample_totalpeakarea = character(), user_defined_sample_pca_outlier = character()), features = list(user_excluded = character(), extreme_feature_missingness = character(), user_defined_feature_missingness = character())), feature_summary = array(data = NA_real_, dim = c(0, 0, 0)), sample_summary = array(data = NA_real_, dim = c(0, 0, 0)) )
data |
numeric matrix, the data matrix containing metabolite values (not to be set directly). |
samples |
data.frame, a data frame containing sample-related information (not to be set directly). |
features |
data.frame, a data frame containing feature-related information (not to be set directly). |
exclusions |
list, holds exclusion codes for data masking (not to be set directly). |
feature_summary |
numeric matrix, summary statistics for features (not to be set directly). |
sample_summary |
numeric matrix, summary statistics for samples (not to be set directly). |
An object of class Metaboprep, an S7 class.
datanumeric matrix, the metabolite data.
samplesdata.frame, the samples data frame
featuresdata.frame, the features data frame
exclusionslist, exclusion codes (mask for data).
feature_summarynumeric matrix, feature summary statistics.
sample_summarynumeric matrix, sample summary statistics.
Cleans a character vector of names by replacing spaces with underscores, removing special characters ('-', '.'), replacing ' and converting to lowercase.
clean_names(names)clean_names(names)
names |
'character vector' A vector of names to be cleaned. |
'character vector' A standardized version of the input names.
clean_names(c("Sample ID", "Feature-Name.1", "Concentration %")) # Returns: c("sample_id", "featurename1", "concentration_pct")clean_names(c("Sample ID", "Feature-Name.1", "Concentration %")) # Returns: c("sample_id", "featurename1", "concentration_pct")
This function (1) identifies an informative distribution of effect and power estimates given your datas total sample size and (2) returns a summary plot.
continuous_power_plot(mydata)continuous_power_plot(mydata)
mydata |
Your metabolite data matrix, with samples in rows |
a ggplot2 object
ex_data = matrix(NA, 1000, 2) continuous_power_plot( ex_data )ex_data = matrix(NA, 1000, 2) continuous_power_plot( ex_data )
Calculates Cramer's V for a table of nominal variables; confidence intervals by bootstrap. Function taken from the rcompanion Rpackage.
cramerV( x, y = NULL, ci = FALSE, conf = 0.95, type = "perc", R = 1000, digits = 4, bias.correct = FALSE, reportIncomplete = FALSE, verbose = FALSE, ... )cramerV( x, y = NULL, ci = FALSE, conf = 0.95, type = "perc", R = 1000, digits = 4, bias.correct = FALSE, reportIncomplete = FALSE, verbose = FALSE, ... )
x |
Either a two-way table or a two-way matrix. Can also be a vector of observations for one dimension of a two-way table. |
y |
If |
ci |
If |
conf |
The level for the confidence interval. |
type |
The type of confidence interval to use.
Can be any of " |
R |
The number of replications to use for bootstrap. |
digits |
The number of significant digits in the output. |
bias.correct |
If |
reportIncomplete |
If |
verbose |
If |
... |
Additional arguments passed to |
Cramer's V is used as a measure of association between two nominal variables, or as an effect size for a chi-square test of association. For a 2 x 2 table, the absolute value of the phi statistic is the same as Cramer's V.
Because V is always positive, if type="perc",
the confidence interval will
never cross zero. In this case,
the confidence interval range should not
be used for statistical inference.
However, if type="norm", the confidence interval
may cross zero.
When V is close to 0 or very large, or with small counts, the confidence intervals determined by this method may not be reliable, or the procedure may fail.
A single statistic, Cramer's V. Or a small data frame consisting of Cramer's V, and the lower and upper confidence limits.
Salvatore Mangiafico, [email protected]
http://rcompanion.org/handbook/H_10.html
This function allows you estimate power for a binary variable given a defined number of case samples, control samples, effect size, and significance threshold.
eval.power.binary.imbalanced(N_case, N_control, effect, alpha)eval.power.binary.imbalanced(N_case, N_control, effect, alpha)
N_case |
a numeric vector of sample size of cases |
N_control |
a numeric vector of sample size of controls |
effect |
a numeric vector of effect size |
alpha |
a numeric vector of significance thresholds |
a matrix of paramater inputs and power estimates are returned as a matrix
eval.power.binary.imbalanced( N_case = 1000, N_control = 1000, effect = 0.01, alpha = 0.05 ) eval.power.binary.imbalanced( N_case = c(1000, 2000), N_control = c(1000, 2000), effect = 0.01, alpha = 0.05 )eval.power.binary.imbalanced( N_case = 1000, N_control = 1000, effect = 0.01, alpha = 0.05 ) eval.power.binary.imbalanced( N_case = c(1000, 2000), N_control = c(1000, 2000), effect = 0.01, alpha = 0.05 )
This function estimates power for a continuous variable given the sample size, effect size, significance threshold, and the degrees of freedom.
eval.power.cont(N, n_coeff, effect, alpha)eval.power.cont(N, n_coeff, effect, alpha)
N |
Sample size |
n_coeff |
degrees of freedom for numerator |
effect |
effect size |
alpha |
significance level (Type 1 error) |
eval.power.cont(N = 1000, n_coeff = 1, effect = 0.0025, alpha = 0.05)eval.power.cont(N = 1000, n_coeff = 1, effect = 0.0025, alpha = 0.05)
Exports all data from a 'Metaboprep' object to a structured directory format. For each data layer, the function creates a subdirectory containing: - the primary data matrix ('data.tsv'), - associated feature and sample metadata ('features.tsv', 'samples.tsv'), - feature and sample summaries (if present, 'feature_summary.tsv', 'sample_summary.tsv'), - a serialized feature tree (if present), - and a 'config.yml' file with additional metadata and processing parameters.
export(metaboprep, directory, format = "metaboprep", ...)export(metaboprep, directory, format = "metaboprep", ...)
metaboprep |
A 'Metaboprep' object containing the data to be exported. |
directory |
character, string specifying the path to the directory where the data should be written. |
format |
character, string specifying the format of the exported data - one of "metaboprep", "comets", or "metaboanalyst". |
... |
Arguments passed on to
|
the 'Metaboprep' object, invisibly, for use in pipes
Export Data to 'COMETS' format
export_comets(metaboprep, directory, layer = NULL)export_comets(metaboprep, directory, layer = NULL)
metaboprep |
A 'Metaboprep' object containing the data to be exported. |
directory |
character, string specifying the path to the directory where the data should be written. |
layer |
character, the name of the 'metaboprep@data' layer (3rd array dimension) to write out |
the 'Metaboprep' object, invisibly, for use in pipes
Export Data to 'MetaboAnalyst' format
export_metaboanalyst(metaboprep, directory, layer = NULL, group_col = NULL)export_metaboanalyst(metaboprep, directory, layer = NULL, group_col = NULL)
metaboprep |
A 'Metaboprep' object containing the data to be exported. |
directory |
character, string specifying the path to the directory where the data should be written. |
layer |
character, the name of the 'metaboprep@data' layer (3rd array dimension) to write out |
group_col |
character, the column name in the 'metaboprep@samples' data identifying the group for one-factor analysis |
the 'Metaboprep' object, invisibly, for use in pipes
Export Data to 'Metaboprep' format
export_metaboprep(metaboprep, directory, ...)export_metaboprep(metaboprep, directory, ...)
metaboprep |
A 'Metaboprep' object containing the data to be exported. |
directory |
character, string specifying the path to the directory where the data should be written. |
... |
other parameters passed to |
the 'Metaboprep' object, invisibly, for use in pipes
This function allows you to 'describe' metabolite features using the describe() function from the psych package, as well as estimate variance, a dispersion index, the coeficent of variation, and shapiro's W-statistic.
feature_describe(data)feature_describe(data)
data |
matrix, the metabolite data matrix. Samples in row, metabolites in columns |
a data frame of summary statistics for features (columns) of a matrix
This function estimates feature statistics for samples in a matrix of metabolite features.
feature_summary( metaboprep, source_layer = "input", outlier_udist = 5, tree_cut_height = 0.5, feature_selection = "max_var_exp", sample_ids = NULL, feature_ids = NULL, features_exclude = NULL, output = "data.frame" )feature_summary( metaboprep, source_layer = "input", outlier_udist = 5, tree_cut_height = 0.5, feature_selection = "max_var_exp", sample_ids = NULL, feature_ids = NULL, features_exclude = NULL, output = "data.frame" )
metaboprep |
an object of class Metabolites |
source_layer |
character, the data layer to summarise |
outlier_udist |
the unit distance in SD or IQR from the mean or median estimate, respectively outliers are identified at. Default value is 5. |
tree_cut_height |
numeric, the threshold for feature independence in hierarchical clustering. Default is 0.5. |
feature_selection |
character, either 'max_var_exp' or 'least_missingness', how to select the independent feature within clusters |
sample_ids |
character, vector of sample ids to work with |
feature_ids |
character, vector of feature ids to work with |
features_exclude |
character, vector of feature id indicating features to exclude from the sample and PCA summary analysis but keep in the data |
output |
character, type of output, either 'object' to return the updated metaboprep object, or 'data.frame' to return the data. |
This function estimates an appropriate distribution of effect sizes to simulate in a continuous trait power analysis.
find.cont.effect.sizes.2.sim(mydata)find.cont.effect.sizes.2.sim(mydata)
mydata |
Your metabolite data matrix, with samples in rows |
a vector of effect sizes
ex_data = sapply(1:10, function(x){ rnorm(250, 40, 5) }) find.cont.effect.sizes.2.sim(ex_data)ex_data = sapply(1:10, function(x){ rnorm(250, 40, 5) }) find.cont.effect.sizes.2.sim(ex_data)
This function estimates an appropriate distribution of effect sizes to simulate in a power analysis.
find.PA.effect.sizes.2.sim(mydata)find.PA.effect.sizes.2.sim(mydata)
mydata |
Your metabolite data matrix, with samples in rows |
a vector of effect sizes
ex_data = sapply(1:10, function(x){ rnorm(250, 40, 5) }) find.PA.effect.sizes.2.sim(ex_data)ex_data = sapply(1:10, function(x){ rnorm(250, 40, 5) }) find.PA.effect.sizes.2.sim(ex_data)
This function writes an output report
generate_report( metaboprep, output_dir, output_filename = NULL, project = "Project", format = "pdf", template = "qc_report" )generate_report( metaboprep, output_dir, output_filename = NULL, project = "Project", format = "pdf", template = "qc_report" )
metaboprep |
an object of class Metaboprep |
output_dir |
character, the directory to save to |
output_filename |
character, default NULL i.e. create from input object |
project |
character, name for the current project |
format |
character, write either 'html' or 'pdf' report |
template |
character, type of report to output only current option is "qc_report" |
This function (1) estimates an informative distribution of effect and power estimates given your datas total sample size, over a distribution of imbalanced sample sizes and (2) returns a summary plot.
imbalanced_power_plot(mydata)imbalanced_power_plot(mydata)
mydata |
a numeric data matrix with samples in rows and features in columns |
a ggplot2 object
ex_data = matrix(NA, 1000, 2) imbalanced_power_plot( ex_data )ex_data = matrix(NA, 1000, 2) imbalanced_power_plot( ex_data )
This function estimates missingness in a matrix of data and provides an option to exclude certain columns or features from the analysis, such as xenobiotics (with high missingness rates) in metabolomics data sets.
missingness(data, by = "row")missingness(data, by = "row")
data |
matrix, a numeric matrix with samples in rows and features in columns |
by |
character, whether to calculate missingness by rows (samples) or column (features) |
data.frame, a data frame of missingness estimates for each sample/feature.
This function performs a multivariate analysis over a dependent|response and numerous independent|explanatory variable
multivariate_anova(dep, indep_df)multivariate_anova(dep, indep_df)
dep |
a vector of the dependent variable |
indep_df |
a data frame of the independent variable |
ggplot2 table figure of
## simulate some correlated data set.seed(1110) n <- 250 mu <- c(5, 45, 25) cmat <- matrix(c(1, 0.5, 0.3, 0.5, 1, 0.25, 0.3, 0.25, 1), nrow = 3, byrow = TRUE) L <- chol(cmat) Z <- matrix(rnorm(n * 3), nrow = n) ex_data <- Z %*% L ex_data <- sweep(ex_data, 2, mu, "+") colnames(ex_data) = c("outcome","age","bmi") multivariate_anova(dep = ex_data[,1], indep_df = ex_data[, 2:3])## simulate some correlated data set.seed(1110) n <- 250 mu <- c(5, 45, 25) cmat <- matrix(c(1, 0.5, 0.3, 0.5, 1, 0.25, 0.3, 0.25, 1), nrow = 3, byrow = TRUE) L <- chol(cmat) Z <- matrix(rnorm(n * 3), nrow = n) ex_data <- Z %*% L ex_data <- sweep(ex_data, 2, mu, "+") colnames(ex_data) = c("outcome","age","bmi") multivariate_anova(dep = ex_data[,1], indep_df = ex_data[, 2:3])
Given a vector or matrix, this function returns a vector or matrix of 0|1, of the same structure with 1 values indicating outliers.
outlier_detection(data, nsd = 5, meansd = FALSE, by = "column")outlier_detection(data, nsd = 5, meansd = FALSE, by = "column")
data |
a matrix of numerical values, samples in row, features in columns |
nsd |
the unit distance in SD or IQR from the mean or median estimate, respectively outliers are identified at. Default value is 5. |
meansd |
set to TRUE if you would like to estimate outliers using a mean and SD method; set to FALSE if you would like to estimate medians and inter quartile ranges. The default is FALSE. |
by |
character, either 'column' to compute along columns or 'row' to compute across rows. Irrelevant for vectors. |
a matrix of 0 (not a sample outlier) and 1 (outlier)
This function identifies outliers from a vector of data at SD units from the mean.
outliers(x, nsd = 3)outliers(x, nsd = 3)
x |
a numerical vector of data |
nsd |
the number of SD units from the mean to be used as an outlier cutoff. |
a list object of length three. (1) a vector of sample indexes indicating the outliers, (2) the lower outlier cuttoff value, (3) the upper outlier cuttoff value.
ex_data = rnbinom(500, mu = 40, size = 5) outliers(ex_data)ex_data = rnbinom(500, mu = 40, size = 5) outliers(ex_data)
This function performs principal component analysis. In the first, missing data is imputed to the median. Subsequent to the derivation of the PC, the median imputed PC data is used to identify the number of informative or "significant" PC by (1) an acceleration analysis, and (2) a parrallel analysis. Finally the number of sample outliers are determined at 3, 4, and 5 standard deviations from the mean on the top PCs as determined by the acceleration factor analysis.
pc_and_outliers( metaboprep, source_layer = "input", sample_ids = NULL, feature_ids = NULL )pc_and_outliers( metaboprep, source_layer = "input", sample_ids = NULL, feature_ids = NULL )
metaboprep |
an object of class Metaboprep |
source_layer |
character, type/source of data to use |
sample_ids |
character, vector of sample ids to include, default NULL includes all |
feature_ids |
character, vector of feature ids to include, default NULL includes all |
a data.frame
This function is a wrapper function that performs the key quality controls steps on a metabolomics data set. Key principles: 1. keep the source underlying data as it is 2. copy the source data to a new data layer called qcing for processing 3. build an exclusion list, accumulating codes for exclusion reasons 4. make any adjustments needed in the destination copy of the data, flag these in the exclusion list 5. copy the final result to a data layer called post_qc 6. return the Metabolites object with the newly populated data layers
quality_control( metaboprep, source_layer = "input", sample_missingness = 0.2, feature_missingness = 0.2, total_peak_area_sd = 5, outlier_udist = 5, outlier_treatment = "leave_be", winsorize_quantile = 1, tree_cut_height = 0.5, feature_selection = "max_var_exp", pc_outlier_sd = 5, max_num_pcs = 10, sample_ids = NULL, feature_ids = NULL, features_exclude_but_keep = NULL )quality_control( metaboprep, source_layer = "input", sample_missingness = 0.2, feature_missingness = 0.2, total_peak_area_sd = 5, outlier_udist = 5, outlier_treatment = "leave_be", winsorize_quantile = 1, tree_cut_height = 0.5, feature_selection = "max_var_exp", pc_outlier_sd = 5, max_num_pcs = 10, sample_ids = NULL, feature_ids = NULL, features_exclude_but_keep = NULL )
metaboprep |
an object of class Metabolites |
source_layer |
character, the data layer to summarise |
sample_missingness |
numeric 0-1, percentage of data missingness which should prompt exclusion of a sample |
feature_missingness |
numeric 0-1, percentage of data missingness which should prompt exclusion of a feature |
total_peak_area_sd |
numeric, number of TPA SD after which a sample would be excluded |
outlier_udist |
the unit distance in SD or IQR from the mean or median estimate, respectively outliers are identified at. Default value is 5. |
outlier_treatment |
character, how to handle outlier data values - options 'leave_be', 'turn_NA', or 'winsorize' |
winsorize_quantile |
numeric, quantile to winsorize to, only relevant if 'outlier_treatment'='winsorize' |
tree_cut_height |
numeric, the threshold for feature independence in hierarchical clustering. Default is 0.5. |
feature_selection |
character, either 'max_var_exp' or 'least_missingness', how to select the independent feature within clusters |
pc_outlier_sd |
numeric, number of PCA SD after which a sample would be excluded |
max_num_pcs |
numeric, the maximum number of PCs to use (look in) when filtering samples on PC outlier SD, default=10, set to NULL to use all informative PCs from the Scree analysis |
sample_ids |
character, vector of sample ids to retain and work with, all others samples will be excluded |
feature_ids |
character, vector of feature ids to retain and work with, all other features will be excluded |
features_exclude_but_keep |
character, vector of feature ids indicating features to exclude from the sample and PCA quality control analysis but keep in the data, OR a name of a logical column in the features data indicating the same |
Read Metabolon Data
read_metabolon( filepath, sheet = NULL, feature_sheet = NULL, feature_id_col = NULL, sample_sheet = NULL, sample_id_col = NULL, return_Metaboprep = TRUE )read_metabolon( filepath, sheet = NULL, feature_sheet = NULL, feature_id_col = NULL, sample_sheet = NULL, sample_id_col = NULL, return_Metaboprep = TRUE )
filepath |
character, commercial Metabolon excel sheet with extension .xls or .xlsx |
sheet |
character or integer, the excel sheet name (or index) from which to read. |
feature_sheet |
character or integer, the excel sheet name (or index) from which to read the feature data. |
feature_id_col |
character, the excel column containing the feature_id mapping to the data. |
sample_sheet |
character or integer, the excel sheet name (or index) from which to read the sample data. |
sample_id_col |
character, the excel column containing the sample_id mapping to the data. |
return_Metaboprep |
logical, if TRUE (default) return a Metaboprep object, if FALSE return a list. |
list or Metaboprep object, list(data = matrix, samples = samples data.frame, features = features data.frame)
# version 1.1 data format filepath1 <- system.file("extdata", "metabolon_v1.1_example.xlsx", package = "metaboprep") m <- read_metabolon(filepath1, sheet = 2) # version 1.2 data format (different column names) filepath2 <- system.file("extdata", "metabolon_v1.2_example.xlsx", package = "metaboprep") m <- read_metabolon(filepath2, sheet = 'OrigScale') # version 2 data format filepath3 <- system.file("extdata", "metabolon_v2_example.xlsx", package = "metaboprep") m <- read_metabolon(filepath3, sheet = 'Batch-normalized Data', feature_sheet = 'Chemical Annotation', feature_id_col = 'CHEM_ID', sample_sheet = 'Sample Meta Data', sample_id_col = 'PARENT_SAMPLE_NAME')# version 1.1 data format filepath1 <- system.file("extdata", "metabolon_v1.1_example.xlsx", package = "metaboprep") m <- read_metabolon(filepath1, sheet = 2) # version 1.2 data format (different column names) filepath2 <- system.file("extdata", "metabolon_v1.2_example.xlsx", package = "metaboprep") m <- read_metabolon(filepath2, sheet = 'OrigScale') # version 2 data format filepath3 <- system.file("extdata", "metabolon_v2_example.xlsx", package = "metaboprep") m <- read_metabolon(filepath3, sheet = 'Batch-normalized Data', feature_sheet = 'Chemical Annotation', feature_id_col = 'CHEM_ID', sample_sheet = 'Sample Meta Data', sample_id_col = 'PARENT_SAMPLE_NAME')
Read Nightingale Data (format 1)
read_nightingale(filepath, return_Metaboprep = TRUE)read_nightingale(filepath, return_Metaboprep = TRUE)
filepath |
character, commercial Nightingale excel sheet with extension .xls or .xlsx |
return_Metaboprep |
logical, if TRUE (default) return a Metaboprep object, if FALSE return a list. |
list or Metaboprep object, list(data = matrix, samples = samples data.frame, features = features data.frame)
# version 1 data format filepath1 <- system.file("extdata", "nightingale_v1_example.xlsx", package = "metaboprep") m <- read_nightingale(filepath1) # version 2 data format filepath2 <- system.file("extdata", "nightingale_v2_example.xlsx", package = "metaboprep") m <- read_nightingale(filepath2)# version 1 data format filepath1 <- system.file("extdata", "nightingale_v1_example.xlsx", package = "metaboprep") m <- read_nightingale(filepath1) # version 2 data format filepath2 <- system.file("extdata", "nightingale_v2_example.xlsx", package = "metaboprep") m <- read_nightingale(filepath2)
This function reads and processes an Olink NPX file in long format. It supports ‘.csv', '.xls', '.xlsx', '.txt', '.zip', and '.parquet' formats, using Olink’s own OlinkAnalyze::read_NPX() function, and returns a metaboprep object or a list of matrices and metadata frames for further analysis.
read_olink(filepath, return_Metaboprep = FALSE)read_olink(filepath, return_Metaboprep = FALSE)
filepath |
A string specifying the path to the Olink NPX file. |
return_Metaboprep |
logical, if TRUE (default) return a Metaboprep object, if FALSE return a list. |
The function checks whether the input data is in long format by verifying the presence of duplicate 'SampleID' values. It also accommodates two variants of Olink files:
Files that include a 'Sample_Type' column with values '"SAMPLE"' and '"CONTROL"'.
Files that use the 'SampleID' column to label control samples (e.g., entries containing '"CONTROL"').
If neither format is detected, the function stops with an error indicating that the data is likely not from Olink.
Metaboprep object or a named list with the following elements:
A matrix of NPX values with 'SampleID' as rows and 'OlinkID' as columns, containing only sample data.
A 'data.frame' containing metadata for samples.
A 'data.frame' containing feature-level metadata for samples.
A matrix of NPX values for control samples.
A 'data.frame' containing metadata for control samples.
## Not run: filepath <- system.file("extdata", "example_olink_data.txt", package = "metaboprep") olink_data <- read_olink(filepath) ## End(Not run)## Not run: filepath <- system.file("extdata", "example_olink_data.txt", package = "metaboprep") olink_data <- read_olink(filepath) ## End(Not run)
This function reads and processes a commercial SomaLogic '.adat' file. It extracts RFU (Relative Florecent Units) data for samples and controls, along with their respective metadata and feature (protein) metadata. The function returns a structured list suitable for further analysis.
read_somalogic(filepath, return_Metaboprep = FALSE)read_somalogic(filepath, return_Metaboprep = FALSE)
filepath |
A string specifying the path to the SomaLogic '.adat' file. |
return_Metaboprep |
logical, if TRUE (default) return a Metaboprep object, if FALSE return a list. |
The function performs several validation steps and data transformations:
It first checks if the provided 'filepath' points to an '.adat' file.
It uses 'SomaDataIO::read_adat()' to import the raw data and 'SomaDataIO::is.soma_adat()' to verify its integrity as a SomaLogic object.
It confirms the presence of the crucial 'SampleId' column.
Data is separated into experimental "Sample" and "Calibrator" control groups based on the 'SampleType' column.
For both sample and control data, RFU values corresponding to "seq" columns are extracted and reshaped into wide matrices with 'SampleId' as row names.
Feature metadata is extracted from 'attr(df, "Col.Meta")', and a new 'feature_id' column is created (prefixed with "seq." and hyphens replaced by periods).
Sample and control metadata are extracted from 'attr(df, "row_meta")', converted to 'tibble's, renamed, and 'sample_id' is relocated to the front. Explicit 'tibble::as_tibble()' and 'tibble::remove_rownames()' are used to handle potential 'SomaDataIO' object intricacies.
A metaboprep object or a named list with the following elements:
A matrix of RFU values for experimental samples, with 'SampleId' as row names and 'SeqId' (from columns containing "seq") as column names.
A 'tibble' containing metadata for experimental samples, with 'sample_id' (renamed from 'SampleId') as the first column.
A 'tibble' containing feature-level metadata (e.g., protein details), including a newly created 'feature_id' column derived from 'SeqId'.
A matrix of RFU values for control samples (specifically "Calibrator" samples), with 'SampleId' as row names and 'SeqId' as column names.
A 'tibble' containing metadata for control samples (specifically "Calibrator" samples), with 'sample_id' as the first column.
## Not run: filepath <- system.file("extdata", "example_data10.adat", package = "SomaDataIO") somalogic_data <- read_somalogic(filepath) ## End(Not run)## Not run: filepath <- system.file("extdata", "example_data10.adat", package = "SomaDataIO") somalogic_data <- read_somalogic(filepath) ## End(Not run)
This function runs the original metaboprep1 pipeline using the old parameter file input format. The function requires access to the internet as the old package (default github commit 'bbe1f85') will be dynamically downloaded and used to process the data.
run_metaboprep1(parameter_file, gitcommit = "bbe1f85", attempt_report = FALSE)run_metaboprep1(parameter_file, gitcommit = "bbe1f85", attempt_report = FALSE)
parameter_file |
character, full file path to the metaboprep 1 parameter file |
gitcommit |
character, Github commit - default pinned to the last stable metaboprep 1 version 'bbe1f85' |
attempt_report |
logical, whether to attempt metaboprep1 report generation. Default=FALSE as this can lead to errors on some operating systems. |
Summarise the sample data
sample_summary( metaboprep, source_layer = "input", outlier_udist = 5, sample_ids = NULL, feature_ids = NULL, output = "data.frame" )sample_summary( metaboprep, source_layer = "input", outlier_udist = 5, sample_ids = NULL, feature_ids = NULL, output = "data.frame" )
metaboprep |
an object of class Metaboprep |
source_layer |
character, the data layer to summarise |
outlier_udist |
the unit distance in SD or IQR from the mean or median estimate, respectively outliers are identified at. Default value is 5. |
sample_ids |
character, vector of sample ids to work with |
feature_ids |
character, vector of feature ids to work with |
output |
character, type of output, either 'object' to return the updated metaboprep object, or 'data.frame' to return the data. |
Launch a Shiny app to explore the Metaboprep object
shiny_app(metaboprep)shiny_app(metaboprep)
metaboprep |
an object of class Metaboprep |
Runs a Shiny app
Summarise the sample and feature data
summarise( metaboprep, source_layer = "input", outlier_udist = 5, tree_cut_height = 0.5, feature_selection = "max_var_exp", sample_ids = NULL, feature_ids = NULL, features_exclude = NULL, output = "data.frame" )summarise( metaboprep, source_layer = "input", outlier_udist = 5, tree_cut_height = 0.5, feature_selection = "max_var_exp", sample_ids = NULL, feature_ids = NULL, features_exclude = NULL, output = "data.frame" )
metaboprep |
an object of class Metaboprep |
source_layer |
character, the data layer to summarise |
outlier_udist |
the unit distance in SD or IQR from the mean or median estimate, respectively outliers are identified at. Default value is 5. |
tree_cut_height |
numeric, the threshold for feature independence in hierarchical clustering. Default is 0.5. |
feature_selection |
character, either 'max_var_exp' or 'least_missingness', how to select the independent feature within clusters |
sample_ids |
character, vector of sample ids to work with |
feature_ids |
character, vector of feature ids to work with |
features_exclude |
character, vector of feature id indicating features to exclude from the sample and PCA summary analysis but keep in the data |
output |
character, type of output, either 'object' to return the updated metaboprep object, or 'data.frame' to return the data. |
Provides a concise, human-readable summary of a 'Metaboprep' object. It reports key dimensions of the data, the presence of metadata columns, the number of data layers, and the status of quality control summaries and exclusions.
object |
A 'Metaboprep' object. |
... |
Additional arguments (not used). |
Invisibly returns NULL. Prints a formatted summary to the console.
This function estimates total peak abundance|area for numeric data in a matrix, for (1) all features and (2) all features with complete data.
total_peak_area(data, ztransform = TRUE)total_peak_area(data, ztransform = TRUE)
data |
matrix, the metabolite data matrix. Samples in rows, metabolites in columns |
ztransform |
logical, should the feature data be z-transformed and absolute value minimum, mean shifted prior to summing the feature values. TRUE or FALSE. |
a data frame of estimates for (1) total peak abundance and (2) total peak abundance at complete features for each samples
This function identifies independent features using Spearman's rho correlation distances, and a dendrogram tree cut step.
tree_and_independent_features( data, tree_cut_height = 0.5, features_exclude = NULL, feature_selection = "max_var_exp" )tree_and_independent_features( data, tree_cut_height = 0.5, features_exclude = NULL, feature_selection = "max_var_exp" )
data |
matrix, the metabolite data matrix. samples in row, metabolites in columns |
tree_cut_height |
the tree cut height. A value of 0.2 (1-Spearman's rho) is equivalent to saying that features with a rho >= 0.8 are NOT independent. |
features_exclude |
character, vector of feature id indicating features to exclude from the sample and PCA summary analysis but keep in the data |
feature_selection |
character. Method for selecting a representative feature from each correlated feature cluster. One of:
|
A list with the following components:
A 'data.frame' with:
'feature_id': Feature (column) names from the input matrix.
'k': The cluster index assigned to each feature after tree cutting.
'independent_features': Logical indicator of whether the feature was selected as an independent (representative) feature.
A ‘hclust' object representing the hierarchical clustering of the features based on 1 - |Spearman’s rho| distance.
This function performs univariate linear analysis of a dependent and an independent variable and generates a viloin or box plot to illustrate the associated structure.
variable_by_factor( dep, indep, dep_name = "dependent", indep_name = "independent", orderfactor = TRUE, violin = TRUE )variable_by_factor( dep, indep, dep_name = "dependent", indep_name = "independent", orderfactor = TRUE, violin = TRUE )
dep |
a vector of the dependent variable |
indep |
a vector of the independent variable |
dep_name |
name of the dependent variable |
indep_name |
name of the independent variable |
orderfactor |
order factors alphebetically |
violin |
box plot or violin plot. violin = TRUE is default |
a ggplot2 object
x = c( rnorm(20, 10, 2), rnorm(20, 20, 2) ) y = as.factor( c( rep("A", 20), rep("B", 20) ) ) variable_by_factor(dep = x , indep = y, dep_name = "expression", indep_name = "species" )x = c( rnorm(20, 10, 2), rnorm(20, 20, 2) ) y = as.factor( c( rep("A", 20), rep("B", 20) ) ) variable_by_factor(dep = x , indep = y, dep_name = "expression", indep_name = "species" )