Feature summary

Create Omiprep object

library(omiprep)

# import data
data     <- read.csv(system.file("extdata", "dummy_data.csv", package = "omiprep"), header=T, row.names = 1, check.names = FALSE) |> as.matrix()
samples  <- read.csv(system.file("extdata", "dummy_samples.csv", package = "omiprep"), header=T, row.names = 1)
features <- read.csv(system.file("extdata", "dummy_features.csv", package = "omiprep"), header=T, row.names = 1)
features$feature_id = as.character(features$feature_id)

# create object
mydata <- Omiprep(data = data, samples = samples, features = features)

Summary of Omiprep object

summary(mydata)
#> Omiprep Object Summary
#> --------------------------
#> Samples      : 125
#> Features     : 253
#> Data Layers  : 1
#> Layer Names  : input
#> 
#> Sample Summary Layers : none
#> Feature Summary Layers: none
#> 
#> Sample Annotation (metadata):
#>   Columns: 10
#>   Names  : sample_id, parent_sample_id, client_identifier, sex, age, bmi, LC.MS.Polar, LC.MS.Neg, LC.MS.Pos.Early, LC.MS.Pos.Late
#> 
#> Feature Annotation (metadata):
#>   Columns: 14
#>   Names  : feature_id, pathway_sortorder, biochemical, super_pathway, sub_pathway, comp_id, platform, chemical_id, ri, mass, cas, pubchem, kegg, group_hmdb
#> 
#> Exclusion Codes Summary:
#> 
#>   Sample Exclusions:
#> Exclusion | Count
#> -----------------
#> user_excluded                     | 0
#> extreme_sample_missingness        | 0
#> user_defined_sample_missingness   | 0
#> user_defined_sample_totalpeakarea | 0
#> user_defined_sample_pca_outlier   | 0
#> 
#>   Feature Exclusions:
#> Exclusion | Count
#> -----------------
#> user_excluded                    | 0
#> extreme_feature_missingness      | 0
#> user_defined_feature_missingness | 0
#> user_defined_feature_skewness    | 0

Run standard quality control

mydata = quality_control(mydata)
#> ── Starting Omics QC Process ───────────────────────────────────────────────────
#> ℹ Validating input parameters
#> ✔ Validating input parameters [6ms]
#> 
#> ℹ Sample & Feature Summary Statistics for raw data
#> ℹ Number of informative PCs (Scree acceleration factor): 2
#> ℹ Sample & Feature Summary Statistics for raw data✔ Sample & Feature Summary Statistics for raw data [2.3s]
#> 
#> ℹ Copying input data to new 'qc' data layer
#> ✔ Copying input data to new 'qc' data layer [22ms]
#> 
#> ℹ Assessing for extreme sample missingness >=80% - excluding 0 sample(s)
#> ✔ Assessing for extreme sample missingness >=80% - excluding 0 sample(s) [20ms]
#> 
#> ℹ Assessing for extreme feature missingness >=80% - excluding 0 feature(s)
#> ✔ Assessing for extreme feature missingness >=80% - excluding 5 feature(s) [22m…
#> 
#> ℹ Assessing for sample missingness at specified level of >=20% - excluding 0 sa…
#> ✔ Assessing for sample missingness at specified level of >=20% - excluding 1 sa…
#> 
#> ℹ Assessing for feature missingness at specified level of >=20% - excluding 0 f…
#> ✔ Assessing for feature missingness at specified level of >=20% - excluding 46 …
#> 
#> ℹ Calculating total sum abundance outliers at +/- 5 Sdev - excluding 0 sample(s)
#> ✔ Calculating total sum abundance outliers at +/- 5 Sdev - excluding 0 sample(s…
#> 
#> ℹ Running sample data PCA outlier analysis at +/- 5 Sdev
#> ✔ Running sample data PCA outlier analysis at +/- 5 Sdev [22ms]
#> 
#> ℹ Sample PCA outlier analysis - re-identify feature independence and PC outlier…
#> ℹ Number of informative PCs (Scree acceleration factor): 2
#> ℹ Sample PCA outlier analysis - re-identify feature independence and PC outlier…! The stated max PCs [max_num_pcs=10] to use in PCA outlier assessment is greater than the number of available informative PCs [2]
#> ℹ Sample PCA outlier analysis - re-identify feature independence and PC outlier…✔ Sample PCA outlier analysis - re-identify feature independence and PC outlier…
#> 
#> ℹ Creating final QC dataset...
#> ℹ Number of informative PCs (Scree acceleration factor): 2
#> ℹ Creating final QC dataset...
#> ℹ Creating final QC dataset...── Step timings ──
#> ℹ Creating final QC dataset...
#> ℹ Creating final QC dataset...
#>                         step seconds   pct
#>                   validation    0.00   0.0
#>                summarise_raw    2.26  38.9
#>                   copy_layer    0.00   0.0
#>   extreme_sample_missingness    0.00   0.0
#>  extreme_feature_missingness    0.00   0.0
#>           sample_missingness    0.00   0.0
#>          total_sum_abundance    0.01   0.2
#>                summarise_pca    1.64  28.2
#>              summarise_final    1.66  28.6
#>                        total    5.81 100.0
#> ✔ Creating final QC dataset... [1.7s]
#> 
#> ℹ 'Omics QC Process Completed
#> ✔ 'Omics QC Process Completed [16ms]

Feature Summary

View feature summary from the QC pipeline

# Note: the quality_control() ultimately returns the feature_summary attribute as a matrix.
df <- t( as.data.frame(mydata@feature_summary[, 1:5, "input"]) )
df <- as.data.frame( round( df , 3) )
df <- cbind(feature_id = rownames(df), df)

df |> knitr::kable( digits = 3, row.names = FALSE, align = "c") |>
  kableExtra::kable_styling(full_width = TRUE) 
feature_id missingness outlier_count n mean sd median min max range skew kurtosis se missing var disp_index coef_variance W log10_W k independent_features
48719 0.248 2 94 4.877 28.121 1 0.393 256.608 256.215 8.027 66.644 2.900 31 790.803 162.154 5.766 0.128 0.564 NA NA
43532 0.568 1 54 1.296 1.096 1 0.302 7.172 6.870 3.189 13.306 0.149 71 1.201 0.927 0.846 0.675 0.977 NA NA
46639 0.128 4 109 2.266 3.517 1 0.078 23.696 23.619 3.379 14.104 0.337 16 12.371 5.460 1.552 0.596 0.992 1 1
606 0.000 0 125 0.990 0.211 1 0.006 1.461 1.455 -1.367 5.791 0.019 0 0.045 0.045 0.214 0.903 0.307 2 1
62279 0.168 1 104 1.315 1.146 1 0.180 7.637 7.457 2.539 8.964 0.112 21 1.313 0.998 0.871 0.759 0.997 3 1

Manually run feature summary

While feature summary is run as a part of the quality_control() function pipeline you can run the function yourself, on any layer you wish.

# NOTE:
# outlier_udist = number of IQRs from the median at which a value is flagged.
# 1.0 here is illustrative; in practice we favour 5.0, which is the default value
# for the quality_control() function.
feature_sum1 <- feature_summary(omiprep         = mydata, 
                                source_layer    = "input", 
                                outlier_udist   = 1.0,
                                tree_cut_height = 0.5,
                                output          = "data.frame", 
                                cores           = 1)

Table of feature summary

feature_sum1 |> 
  head(n = 10) |> 
  knitr::kable( digits = 3, row.names = FALSE, align = "c") |>
  kableExtra::kable_styling(full_width = FALSE) 
feature_id missingness outlier_count n mean sd median min max range skew kurtosis se missing var disp_index coef_variance W log10_W k independent_features
48719 0.248 17 94 4.877 28.121 1 0.393 256.608 256.215 8.027 66.644 2.900 31 790.803 162.154 5.766 0.128 0.564 NA NA
43532 0.568 10 54 1.296 1.096 1 0.302 7.172 6.870 3.189 13.306 0.149 71 1.201 0.927 0.846 0.675 0.977 NA NA
46639 0.128 22 109 2.266 3.517 1 0.078 23.696 23.619 3.379 14.104 0.337 16 12.371 5.460 1.552 0.596 0.992 1 TRUE
606 0.000 27 125 0.990 0.211 1 0.006 1.461 1.455 -1.367 5.791 0.019 0 0.045 0.045 0.214 0.903 0.307 2 TRUE
62279 0.168 18 104 1.315 1.146 1 0.180 7.637 7.457 2.539 8.964 0.112 21 1.313 0.998 0.871 0.759 0.997 3 TRUE
2342 0.008 27 124 1.110 0.620 1 0.048 3.500 3.452 1.053 1.586 0.056 1 0.385 0.347 0.559 0.940 0.907 4 TRUE
53010 0.000 25 125 1.021 0.240 1 0.516 1.753 1.237 0.453 0.065 0.021 0 0.058 0.056 0.235 0.985 0.996 5 TRUE
52435 0.000 24 125 0.998 0.284 1 0.005 1.715 1.709 -0.357 1.234 0.025 0 0.081 0.081 0.285 0.979 0.440 6 TRUE
33384 0.608 11 49 7.325 19.419 1 0.248 112.333 112.085 3.911 16.323 2.774 76 377.115 51.483 2.651 0.401 0.861 NA NA
52468 0.008 24 124 1.042 0.402 1 0.005 2.519 2.515 0.628 1.323 0.036 1 0.161 0.155 0.386 0.969 0.534 7 TRUE

Run feature summary on subset

Using the sample_ids and feature_ids arguments you can run the summary for a subset of the data. Note: all rows will be return, however summary data will only be returned for the specified ids.

## define a vector of sample IDs
sids <- mydata@samples[mydata@samples$sex == "female", "sample_id"] 

## define a vector of feature IDs
fids <- mydata@features[, "feature_id"] |> sample(25)

# NOTE:
# outlier_udist = number of IQRs from the median at which a value is flagged.
# 1.0 here is illustrative; in practice we favour 5.0, which is the default value
# for the quality_control() function.
feature_sum_subset <- feature_summary(omiprep         = mydata, 
                                      source_layer    = "input", 
                                      outlier_udist   = 1.0,
                                      tree_cut_height = 0.5,
                                      sample_ids      = sids,
                                      feature_ids     = fids,
                                      output          = "data.frame",
                                      cores           = 1)

Table of feature summary for subset

feature_sum_subset |> 
  na.omit() |>
  knitr::kable( digits = 3, row.names = FALSE, align = "c") |>
  kableExtra::kable_styling(full_width = FALSE) |>
  kableExtra::scroll_box(width = "100%", height = "500px")
feature_id missingness outlier_count n mean sd median min max range skew kurtosis se missing var disp_index coef_variance W log10_W k independent_features
53010 0.000 10 65 1.017 0.273 0.981 0.516 1.753 1.237 0.533 -0.246 0.034 0 0.075 0.073 0.268 0.972 0.993 14 TRUE
62101 0.000 11 65 1.098 0.267 1.049 0.590 1.928 1.337 0.915 0.872 0.033 0 0.071 0.065 0.243 0.945 0.989 7 TRUE
52234 0.000 14 65 1.105 0.343 1.115 0.009 1.979 1.970 -0.075 0.772 0.043 0 0.117 0.106 0.310 0.986 0.490 11 TRUE
48433 0.031 13 63 1.241 0.944 0.967 0.238 5.642 5.405 2.692 8.242 0.119 2 0.891 0.718 0.761 0.694 0.965 3 FALSE
32198 0.000 8 65 1.023 0.315 1.008 0.010 1.961 1.951 0.070 1.055 0.039 0 0.099 0.097 0.308 0.979 0.494 9 TRUE
49529 0.000 9 65 1.333 0.962 1.000 0.194 4.782 4.588 1.195 1.263 0.119 0 0.925 0.694 0.722 0.888 0.982 12 TRUE
21158 0.000 8 65 1.110 0.449 1.021 0.014 2.605 2.592 0.974 1.635 0.056 0 0.201 0.182 0.404 0.931 0.617 1 FALSE
33935 0.015 14 64 1.571 1.696 0.963 0.026 10.498 10.472 2.634 9.978 0.212 1 2.876 1.830 1.079 0.743 0.961 6 TRUE
54840 0.000 13 65 1.147 0.577 1.047 0.009 3.449 3.441 1.449 3.129 0.072 0 0.333 0.291 0.503 0.892 0.667 18 TRUE
32417 0.138 13 56 1.182 0.694 0.983 0.009 3.722 3.713 1.503 2.422 0.093 9 0.482 0.408 0.587 0.863 0.708 10 TRUE
35107 0.077 13 60 1.192 0.669 0.985 0.323 4.217 3.894 2.079 5.704 0.086 5 0.448 0.376 0.561 0.808 0.982 3 TRUE
46645 0.000 11 65 1.285 0.980 0.918 0.088 4.731 4.643 1.345 1.749 0.122 0 0.961 0.748 0.763 0.879 0.980 13 TRUE
18394 0.015 8 64 1.233 1.019 0.936 0.016 4.665 4.649 1.123 1.075 0.127 1 1.038 0.842 0.826 0.900 0.923 5 TRUE
52690 0.000 10 65 1.542 1.214 1.200 0.404 6.646 6.242 2.376 5.808 0.151 0 1.475 0.956 0.787 0.710 0.962 19 TRUE
52454 0.000 16 65 1.012 0.198 1.004 0.007 1.459 1.452 -1.613 8.393 0.025 0 0.039 0.039 0.196 0.858 0.281 15 TRUE
590 0.015 7 64 1.051 0.336 1.001 0.398 2.038 1.640 0.392 -0.229 0.042 1 0.113 0.107 0.319 0.981 0.983 16 TRUE
53263 0.000 11 65 1.270 0.706 1.106 0.010 3.500 3.491 1.057 0.863 0.088 0 0.498 0.393 0.556 0.922 0.746 4 TRUE
49681 0.000 10 65 1.248 0.574 1.106 0.009 2.772 2.763 0.672 -0.155 0.071 0 0.329 0.264 0.460 0.946 0.645 1 TRUE
18357 0.000 10 65 1.084 0.953 0.832 0.065 5.368 5.303 1.986 5.212 0.118 0 0.908 0.837 0.879 0.814 0.986 2 TRUE
46518 0.154 13 55 1.068 0.605 0.987 0.010 3.902 3.892 2.373 8.435 0.082 10 0.366 0.343 0.566 0.782 0.674 20 TRUE
52012 0.169 13 54 2.239 2.853 1.025 0.457 14.021 13.563 2.443 5.639 0.388 11 8.142 3.637 1.274 0.585 0.802 8 TRUE
31675 0.015 18 64 0.975 0.436 0.953 0.284 2.324 2.040 0.590 0.231 0.055 1 0.190 0.195 0.447 0.959 0.952 17 TRUE

Additional feature_summary() attributes

## The attributes include column names, row names, and class for the feature summary table
## as well as a hierarchical cluster dendrogram or `input_tree` and the parameter values for 
## outlier_udist and input_tree_cut_height passed to the function. 
names( attributes(feature_sum1) )
#> [1] "names"                 "row.names"             "class"                
#> [4] "input_tree"            "input_outlier_udist"   "input_tree_cut_height"

hierarchical cluster dendrogram

In addition to the summary data, the hierarchical cluster dendrogram is appended to the returned data.frame as and attribute. This can be accessed with the attribute name: [source_layer]_tree, in this case we summarised the input data, therefore the attribute name is input_tree.

suppressPackageStartupMessages(library(dendextend))

## number of independent features
indfeatcount = sum( feature_sum1$independent_features, na.rm = TRUE )

# extract tree from attributes
tree <- attr(feature_sum1, 'input_tree')
dend <- stats::as.dendrogram(tree)

# color the independent features blue
metab_color       <- feature_sum1[, c("feature_id", "independent_features")]
metab_color       <- metab_color[match(labels(dend), metab_color$feature_id), ]
metab_color$color <- ifelse(metab_color$independent_features==TRUE, "#084594", "grey80")

# format dendrogram for ploting
dend <- dend |>
  dendextend::set("labels_cex", 0.75) |>
  dendextend::set("labels_col", metab_color$color) |>
  dendextend::set("branches_lwd", 1) |>
  dendextend::set("branches_k_color",  value = metab_color$color)

## plot the dendrogram
dend |> plot(main  = paste0("Feature clustering dendrogram\n# of ind. features = ",indfeatcount ))
abline(h = 0.5, col = "#E41A1C", lwd = 1.5)

Decision tree showing feature importance in dataset

Run sample & feature summaries together

# NOTE:
# outlier_udist = number of IQRs from the median at which a value is flagged.
# 1.0 here is illustrative; in practice we favour 5.0, which is the default value
# for the quality_control() function.
sf_sum <- summarise(omiprep         = mydata, 
                    source_layer    = "input", 
                    outlier_udist   = 1.0,
                    tree_cut_height = 0.5,
                    sample_ids      = sids, ## It is also possible to run on a subset of samples and/or features
                    feature_ids     = fids,
                    output          = "data.frame", 
                    cores           = 1)
#> ℹ Number of informative PCs (Scree acceleration factor): 2

Table of feature summary from summarise() function

sf_sum$feature_summary |> 
  na.omit() |>
  knitr::kable( digits = 3, row.names = FALSE, align = "c") |>
  kableExtra::kable_styling(full_width = TRUE) |>
  kableExtra::scroll_box(width = "100%", height = "500px")
feature_id missingness outlier_count n mean sd median min max range skew kurtosis se missing var disp_index coef_variance W log10_W k independent_features
53010 0.000 10 65 1.017 0.273 0.981 0.516 1.753 1.237 0.533 -0.246 0.034 0 0.075 0.073 0.268 0.972 0.993 14 TRUE
62101 0.000 11 65 1.098 0.267 1.049 0.590 1.928 1.337 0.915 0.872 0.033 0 0.071 0.065 0.243 0.945 0.989 7 TRUE
52234 0.000 14 65 1.105 0.343 1.115 0.009 1.979 1.970 -0.075 0.772 0.043 0 0.117 0.106 0.310 0.986 0.490 11 TRUE
48433 0.031 13 63 1.241 0.944 0.967 0.238 5.642 5.405 2.692 8.242 0.119 2 0.891 0.718 0.761 0.694 0.965 3 FALSE
32198 0.000 8 65 1.023 0.315 1.008 0.010 1.961 1.951 0.070 1.055 0.039 0 0.099 0.097 0.308 0.979 0.494 9 TRUE
49529 0.000 9 65 1.333 0.962 1.000 0.194 4.782 4.588 1.195 1.263 0.119 0 0.925 0.694 0.722 0.888 0.982 12 TRUE
21158 0.000 8 65 1.110 0.449 1.021 0.014 2.605 2.592 0.974 1.635 0.056 0 0.201 0.182 0.404 0.931 0.617 1 FALSE
33935 0.015 14 64 1.571 1.696 0.963 0.026 10.498 10.472 2.634 9.978 0.212 1 2.876 1.830 1.079 0.743 0.961 6 TRUE
54840 0.000 13 65 1.147 0.577 1.047 0.009 3.449 3.441 1.449 3.129 0.072 0 0.333 0.291 0.503 0.892 0.667 18 TRUE
32417 0.138 13 56 1.182 0.694 0.983 0.009 3.722 3.713 1.503 2.422 0.093 9 0.482 0.408 0.587 0.863 0.708 10 TRUE
35107 0.077 13 60 1.192 0.669 0.985 0.323 4.217 3.894 2.079 5.704 0.086 5 0.448 0.376 0.561 0.808 0.982 3 TRUE
46645 0.000 11 65 1.285 0.980 0.918 0.088 4.731 4.643 1.345 1.749 0.122 0 0.961 0.748 0.763 0.879 0.980 13 TRUE
18394 0.015 8 64 1.233 1.019 0.936 0.016 4.665 4.649 1.123 1.075 0.127 1 1.038 0.842 0.826 0.900 0.923 5 TRUE
52690 0.000 10 65 1.542 1.214 1.200 0.404 6.646 6.242 2.376 5.808 0.151 0 1.475 0.956 0.787 0.710 0.962 19 TRUE
52454 0.000 16 65 1.012 0.198 1.004 0.007 1.459 1.452 -1.613 8.393 0.025 0 0.039 0.039 0.196 0.858 0.281 15 TRUE
590 0.015 7 64 1.051 0.336 1.001 0.398 2.038 1.640 0.392 -0.229 0.042 1 0.113 0.107 0.319 0.981 0.983 16 TRUE
53263 0.000 11 65 1.270 0.706 1.106 0.010 3.500 3.491 1.057 0.863 0.088 0 0.498 0.393 0.556 0.922 0.746 4 TRUE
49681 0.000 10 65 1.248 0.574 1.106 0.009 2.772 2.763 0.672 -0.155 0.071 0 0.329 0.264 0.460 0.946 0.645 1 TRUE
18357 0.000 10 65 1.084 0.953 0.832 0.065 5.368 5.303 1.986 5.212 0.118 0 0.908 0.837 0.879 0.814 0.986 2 TRUE
46518 0.154 13 55 1.068 0.605 0.987 0.010 3.902 3.892 2.373 8.435 0.082 10 0.366 0.343 0.566 0.782 0.674 20 TRUE
52012 0.169 13 54 2.239 2.853 1.025 0.457 14.021 13.563 2.443 5.639 0.388 11 8.142 3.637 1.274 0.585 0.802 8 TRUE
31675 0.015 18 64 0.975 0.436 0.953 0.284 2.324 2.040 0.590 0.231 0.055 1 0.190 0.195 0.447 0.959 0.952 17 TRUE