Read in the Metabolon data using the read_metabolon
function. Here we will read in the example data provided with the
package, as a list object.
str(dat)
#> List of 3
#> $ data : num [1:100, 1:104] 98551 43695 44899 37811 36825 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:100] "ind1" "ind2" "ind3" "ind4" ...
#> .. ..$ : chr [1:104] "123" "124" "125" "126" ...
#> $ samples :'data.frame': 100 obs. of 5 variables:
#> ..$ sample_id : chr [1:100] "ind1" "ind2" "ind3" "ind4" ...
#> ..$ parent_sample_id : chr [1:100] "ps_id" "ps_id" "ps_id" "ps_id" ...
#> ..$ client_identifier: chr [1:100] "FR01234" "FR01235" "FR01236" "FR01237" ...
#> ..$ pair : chr [1:100] "99999" "99999" "99999" "99999" ...
#> ..$ volume_extracted : chr [1:100] "100" "100" "100" "100" ...
#> $ features:'data.frame': 104 obs. of 14 variables:
#> ..$ feature_id : chr [1:104] "123" "124" "125" "126" ...
#> ..$ pathway_sortorder: chr [1:104] "1" "2" "3" "4" ...
#> ..$ biochemical : chr [1:104] "(N(1) + N(8))-acetylspermidine" "1,2,3-benzenetriol sulfate (2)" "1,2-dilinoleoyl-GPC (18:2/18:2)" "1,2-dilinoleoyl-GPE (18:2/18:2)*" ...
#> ..$ super_pathway : chr [1:104] "Amino Acid" "Xenobiotics" "Lipid" "Lipid" ...
#> ..$ sub_pathway : chr [1:104] "Polyamine Metabolism" "Chemical" "Phosphatidylcholine (PC)" "Phosphatidylethanolamine (PE)" ...
#> ..$ comp_id : chr [1:104] "123" "124" "125" "126" ...
#> ..$ platform : chr [1:104] "LC/MS Pos Early" "LC/MS Neg" "LC/MS Pos Late" "LC/MS Pos Late" ...
#> ..$ chemical_id : chr [1:104] "1111" "1112" "1113" "1114" ...
#> ..$ ri : chr [1:104] "2221" "2222" "2223" "2224" ...
#> ..$ mass : chr [1:104] "111.111" "111.11199999999999" "111.113" "111.114" ...
#> ..$ cas : chr [1:104] NA NA "111-11-1" NA ...
#> ..$ pubchem : chr [1:104] NA NA "11111" "11112" ...
#> ..$ kegg : chr [1:104] NA NA NA NA ...
#> ..$ group_hmdb : chr [1:104] NA NA "HMDB123" "HMDB124" ...Once imported, we pass the data to the Omiprep() function to build
the Omiprep class object.
summary(mydata)
#> Omiprep Object Summary
#> --------------------------
#> Samples : 100
#> Features : 104
#> Data Layers : 1
#> Layer Names : input
#>
#> Sample Summary Layers : none
#> Feature Summary Layers: none
#>
#> Sample Annotation (metadata):
#> Columns: 5
#> Names : sample_id, parent_sample_id, client_identifier, pair, volume_extracted
#>
#> Feature Annotation (metadata):
#> Columns: 14
#> Names : feature_id, pathway_sortorder, biochemical, super_pathway, sub_pathway, comp_id, platform, chemical_id, ri, mass, cas, pubchem, kegg, group_hmdb
#>
#> Exclusion Codes Summary:
#>
#> Sample Exclusions:
#> Exclusion | Count
#> -----------------
#> user_excluded | 0
#> extreme_sample_missingness | 0
#> user_defined_sample_missingness | 0
#> user_defined_sample_totalpeakarea | 0
#> user_defined_sample_pca_outlier | 0
#>
#> Feature Exclusions:
#> Exclusion | Count
#> -----------------
#> user_excluded | 0
#> extreme_feature_missingness | 0
#> user_defined_feature_missingness | 0
#> user_defined_feature_skewness | 0Use the feature data just imported to identify xenobiotic metabolites. It may be best to excluded these features from the quality-control (QC) process. Xenobiotics typically exhibit much higher levels of missingness than endogenous metabolites, and including them in QC can result in excessive exclusion of both features and samples. This step will allow you to retain these features in the final dataset, by excluding them from QC filtering steps.
Perform the QC steps using the quality_control function,
specifying the xenobiotics to exclude from the QC steps.
## Given the high missingness in metabolon data,
## we suggest using the `least_missingness` feature selection method
## for the identification of principle variable that will then be
## used in the construction of PCs.
mydata <- mydata |>
quality_control(source_layer = "input",
sample_missingness = 0.2,
feature_missingness = 0.2,
total_sum_abundance_sd = 5,
outlier_udist = 5,
outlier_treatment = "leave_be",
winsorize_quantile = 1.0,
tree_cut_height = 0.5,
pc_outlier_sd = 5,
feature_selection = "least_missingness", ## We suggest using `least_missingness` when working with data, like Metabolon, with high missingness. Default is "max_var_exp".
features_exclude_but_keep = xenos, ## exclude xenobiotics from QC, but retain them in the final dataset
cores = 1
)
#>
#> ── Starting Omics QC Process ───────────────────────────────────────────────────
#> ℹ Validating input parameters
#>
#> ℹ Validating input parameters── Starting 'Omics QC Process ──────────────────────────────────────────────────
#> ℹ Validating input parameters✔ Validating input parameters [9ms]
#>
#> ℹ Validating input parameters
#> ✔ Validating input parameters [13ms]
#>
#> ℹ Excluding 0 features from sample summary analysis but keeping in output data
#> ✔ Excluding 7 features from sample summary analysis but keeping in output data …
#>
#> ℹ Sample & Feature Summary Statistics for raw data
#> AF = 2
#> ✔ Sample & Feature Summary Statistics for raw data [567ms]
#>
#> ℹ Copying input data to new 'qc' data layer
#> ✔ Copying input data to new 'qc' data layer [23ms]
#>
#> ℹ Assessing for extreme sample missingness >=80% - excluding 0 sample(s)
#> ✔ Assessing for extreme sample missingness >=80% - excluding 1 sample(s) [20ms]
#>
#> ℹ Assessing for extreme feature missingness >=80% - excluding 0 feature(s)
#> ✔ Assessing for extreme feature missingness >=80% - excluding 0 feature(s) [16m…
#>
#> ℹ Assessing for sample missingness at specified level of >=20% - excluding 0 sa…
#> ✔ Assessing for sample missingness at specified level of >=20% - excluding 0 sa…
#>
#> ℹ Assessing for feature missingness at specified level of >=20% - excluding 0 f…
#> ✔ Assessing for feature missingness at specified level of >=20% - excluding 1 f…
#>
#> ℹ Calculating total sum abundance outliers at +/- 5 Sdev - excluding 0 sample(s)
#> ✔ Calculating total sum abundance outliers at +/- 5 Sdev - excluding 0 sample(s…
#>
#> ℹ Running sample data PCA outlier analysis at +/- 5 Sdev
#> ✔ Running sample data PCA outlier analysis at +/- 5 Sdev [29ms]
#>
#> ℹ Sample PCA outlier analysis - re-identify feature independence and PC outlier…
#> AF = 2
#> ! The stated max PCs [max_num_pcs=10] to use in PCA outlier assessment is greater than the number of available informative PCs [2]
#> ℹ Sample PCA outlier analysis - re-identify feature independence and PC outlier…✔ Sample PCA outlier analysis - re-identify feature independence and PC outlier…
#>
#> ℹ Creating final QC dataset...
#> AF = 2
#>
#> ℹ Creating final QC dataset...── Step timings ──
#> ℹ Creating final QC dataset...
#> ℹ Creating final QC dataset...
#> step seconds pct
#> validation 0.02 1.0
#> summarise_raw 0.55 27.2
#> copy_layer 0.00 0.0
#> extreme_sample_missingness 0.00 0.0
#> extreme_feature_missingness 0.00 0.0
#> sample_missingness 0.00 0.0
#> total_sum_abundance 0.00 0.0
#> summarise_pca 0.58 28.7
#> summarise_final 0.63 31.2
#> total 2.02 99.9
#> ✔ Creating final QC dataset... [678ms]
#>
#> ℹ 'Omics QC Process Completed
#> ✔ 'Omics QC Process Completed [21ms]summary(mydata)
#> Omiprep Object Summary
#> --------------------------
#> Samples : 100
#> Features : 104
#> Data Layers : 2
#> Layer Names : input, qc
#>
#> Sample Summary Layers : input, qc
#> Feature Summary Layers: input, qc
#>
#> Sample Annotation (metadata):
#> Columns: 7
#> Names : sample_id, parent_sample_id, client_identifier, pair, volume_extracted, reason_excluded, excluded
#>
#> Feature Annotation (metadata):
#> Columns: 16
#> Names : feature_id, pathway_sortorder, biochemical, super_pathway, sub_pathway, comp_id, platform, chemical_id, ri, mass, cas, pubchem, kegg, group_hmdb, reason_excluded, excluded
#>
#> Exclusion Codes Summary:
#>
#> Sample Exclusions:
#> Exclusion | Count
#> -----------------
#> user_excluded | 0
#> extreme_sample_missingness | 1
#> user_defined_sample_missingness | 0
#> user_defined_sample_totalpeakarea | 0
#> user_defined_sample_pca_outlier | 0
#>
#> Feature Exclusions:
#> Exclusion | Count
#> -----------------
#> user_excluded | 0
#> extreme_feature_missingness | 0
#> user_defined_feature_missingness | 1
#> user_defined_feature_skewness | 0