--- title: "Skewness-Based Feature QC" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Skewness-Based Feature QC} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Upper limit of quantification (skewness) Some measurement platforms have upper limits of quantification which can result in skewed data with ceiling effects. In this vignette we: 1. simulate a dataset (`1000` samples x `500` features), 2. show 3 feature types (normal, left-skew, right-skew), 3. explain what skewness means, 4. run skewness filtering in `quality_control()`, 5. and visualize pre/post filtering impact. ```{r setup} library(omiprep) library(ggplot2) set.seed(20260408) ``` ## 1) Simulate Data (1000 Samples x 500 Features) We create 3 feature groups: - `normal`: approximately symmetric distributions, - `left_skew`: features with upper-end truncation (mimicking ceiling effects), - `right_skew`: positively skewed features. ```{r simulate_data} n_samples <- 1000 n_normal <- 380 n_left_skew <- 80 n_right_skew <- 40 stopifnot(n_normal + n_left_skew + n_right_skew == 500) # 1) approximately symmetric features normal_block <- matrix( rnorm(n_samples * n_normal, mean = 0, sd = 1), nrow = n_samples, ncol = n_normal ) # 2) left-skew features generated via upper-end truncation left_raw <- matrix( rnorm(n_samples * n_left_skew, mean = 1.2, sd = 0.9), nrow = n_samples, ncol = n_left_skew ) upper_limit <- 0.7 left_skew_block <- pmin(left_raw, upper_limit) # 3) right-skew features right_block <- matrix( rlnorm(n_samples * n_right_skew, meanlog = 0, sdlog = 0.55), nrow = n_samples, ncol = n_right_skew ) data <- cbind(normal_block, left_skew_block, right_block) colnames(data) <- c( paste0("normal_", sprintf("%03d", seq_len(n_normal))), paste0("left_skew_", sprintf("%03d", seq_len(n_left_skew))), paste0("right_skew_", sprintf("%03d", seq_len(n_right_skew))) ) rownames(data) <- paste0("sample_", seq_len(n_samples)) samples <- data.frame(sample_id = rownames(data)) features <- data.frame( feature_id = colnames(data), simulated_group = c( rep("normal", n_normal), rep("left_skew", n_left_skew), rep("right_skew", n_right_skew) ) ) ``` Representative distributions for one feature from each group: ```{r example_histograms, fig.width=10, fig.height=4.8} inspect_features <- c("normal_001", "left_skew_001", "right_skew_001") plot_df <- data.frame( feature = rep(inspect_features, each = n_samples), value = as.vector(data[, inspect_features, drop = FALSE]) ) ggplot(plot_df, aes(x = value)) + geom_histogram(bins = 35, fill = "#377EB8", color = "white") + facet_wrap(~feature, scales = "free", nrow = 1) + theme_classic() + labs( x = "Feature abundance value", y = "Count (samples)", title = "Example feature distributions by simulated type" ) ``` This is the key motivation: some individual **features** have visibly skewed distributions. In the above we can see how the distribution of feature `001` changes based on the simulation; we will focus on the `left_skew` data. ## 2) What skewness means (per feature) `omiprep` computes skewness **for each feature across samples** using `psych::describe()` and takes the `skew` column (default `type = 3` in `psych::skew`): \[ \mathrm{skew} = \frac{\sum_{i=1}^{n} (x_i-\bar{x})^3}{n \cdot s^3} \] where \(x_i\) are sample values for one feature, \(\bar{x}\) is the feature mean, \(s\) is the feature SD, and \(n\) is the number of non-missing sample values. Plain-language interpretation: - `skew < 0` (`left_skew`): most values are concentrated on the right side, with a tail extending to the left. - `skew > 0` (`right_skew`): most values are concentrated on the left side, with a tail extending to the right. - larger `abs(skew)` means stronger asymmetry. ### What the threshold means The threshold is a **cutoff on skewness magnitude**, not a p-value. - With `direction = "left"`, a feature is excluded if `skew <= -threshold`. - With `direction = "right"`, a feature is excluded if `skew >= threshold`. - With `direction = "both"`, a feature is excluded if `abs(skew) >= threshold`. So for `threshold = 1.25` and `direction = "left"`, features with skewness `<= -1.25` are flagged/excluded. The helper `feature_skewness()` returns one row per feature: ```{r feature_skewness_estimation} skew_df <- feature_skewness( data = data, threshold = 1.25, direction = "left" ) skew_df$simulated_group <- features$simulated_group[match(skew_df$feature_id, features$feature_id)] head(skew_df[order(skew_df$skew), ], 10) ``` Skewness distributions by simulated group; we can see that the normally distributed features are mostly around `skew = 0`, the left-skewed features have negative skewness values (many below `-1.25`), and the right-skewed features have positive skewness values. ```{r skew_by_group_plot, fig.width=9.5, fig.height=4.8} ggplot(skew_df, aes(x = simulated_group, y = skew, fill = simulated_group)) + geom_boxplot(outlier.alpha = 0.25) + theme_classic() + scale_fill_manual(values = c( normal = "#9ECAE1", left_skew = "#E41A1C", right_skew = "#4DAF4A" )) + guides(fill = "none") + labs( x = "Simulated feature type", y = "Skewness value (per feature)", title = "Feature-level skewness by simulated group" ) ``` ## 3) Apply a skewness filtering rule To keep this vignette fast while using `1000 x 500` data, we apply the same feature-level rule directly: ```{r apply_skew_rule} skew_rule <- feature_skewness( data = data, threshold = 1.25, direction = "left" ) excluded_skew <- skew_rule$feature_id[skew_rule$exclude_by_skewness %in% TRUE] ``` How many features are excluded by skewness? ```{r excluded_features} length(excluded_skew) head(excluded_skew, 20) ``` Exclusions by simulated feature type: ```{r excluded_summary} excluded_tbl <- data.frame(feature_id = excluded_skew) excluded_tbl$simulated_group <- features$simulated_group[match(excluded_tbl$feature_id, features$feature_id)] table(excluded_tbl$simulated_group, useNA = "ifany") ``` ```{r qc_call_reference, eval=TRUE} m <- Omiprep(data = data, samples = samples, features = features) m_qc <- quality_control( omiprep = m, source_layer = "input", sample_missingness = 0.2, feature_missingness = 0.2, feature_skewness_threshold = 1.25, feature_skewness_direction = "left", total_sum_abundance_sd = NA, outlier_udist = 5, pc_outlier_sd = NA, cores = 2, fast = FALSE ) ``` ## 4) Post-filtering impact on distributions All plots below are feature-level summaries (counting features), except where explicitly stated. ```{r post_filter_prepare} included_features <- setdiff(colnames(data), excluded_skew) pre_skew <- feature_skewness(data, threshold = NULL) post_matrix <- data[, included_features, drop = FALSE] post_skew <- feature_skewness(post_matrix, threshold = NULL) pre_skew$stage <- factor("Pre-filter", levels = c("Pre-filter", "Post-filter")) post_skew$stage <- factor("Post-filter", levels = c("Pre-filter", "Post-filter")) skew_compare <- rbind( pre_skew[, c("feature_id", "skew", "stage")], post_skew[, c("feature_id", "skew", "stage")] ) ``` ### 4a) Feature-skewness distribution before vs after filtering ```{r post_filter_skew_hist, fig.width=10, fig.height=4.8} params_used <- attributes(m_qc@data) skewness_threshold <- ifelse(params_used$qc_feature_skewness_direction=="left", -1 * params_used$qc_feature_skewness_threshold, 1 * params_used$qc_feature_skewness_threshold) ggplot(skew_compare, aes(x = skew, fill = stage)) + geom_histogram(bins = 30, color = "white", alpha = 0.85) + facet_wrap(~stage, nrow = 1) + geom_vline(xintercept = 0, color = "#4D4D4D", linetype = "dashed") + geom_vline(xintercept = skewness_threshold, color = "#E41A1C", linetype = "dashed", linewidth = 1) + theme_classic() + scale_fill_manual(values = c("Pre-filter" = "#9ECAE1", "Post-filter" = "#3182BD")) + guides(fill = "none") + labs( x = "Skewness value (per feature)", y = "Count (features)", title = "Feature-level skewness distribution before and after filtering", subtitle = paste0("Red dashed line: left-skew exclusion threshold = ", skewness_threshold) ) ``` ### 4b) Feature retention by simulated type ```{r post_filter_group_counts, fig.width=9, fig.height=4.8} pre_counts <- as.data.frame(table(features$simulated_group), stringsAsFactors = FALSE) names(pre_counts) <- c("simulated_group", "n") pre_counts$stage <- factor("Pre-filter", levels = c("Pre-filter", "Post-filter")) post_counts <- as.data.frame(table(features$simulated_group[features$feature_id %in% included_features]), stringsAsFactors = FALSE) names(post_counts) <- c("simulated_group", "n") post_counts$stage <- factor("Post-filter", levels = c("Pre-filter", "Post-filter")) counts_df <- rbind(pre_counts, post_counts) counts_df$simulated_group <- factor(counts_df$simulated_group, levels = c("normal", "left_skew", "right_skew")) ggplot(counts_df, aes(x = simulated_group, y = n, fill = stage)) + geom_col(position = position_dodge(width = 0.75), width = 0.65) + theme_classic() + scale_fill_manual(values = c("Pre-filter" = "#9ECAE1", "Post-filter" = "#3182BD"), drop = FALSE) + labs( x = "Simulated feature type", y = "Count (features)", fill = "Stage", title = "Feature retention by simulated type after skewness filtering" ) ``` ### 4c) Skewness profiles by feature type (pre vs post) ```{r post_filter_group_skew_boxplot, fig.width=10, fig.height=5} pre_plot <- merge(pre_skew, features, by = "feature_id", all.x = TRUE) post_plot <- merge(post_skew, features, by = "feature_id", all.x = TRUE) box_df <- rbind( data.frame(stage = "Pre-filter", simulated_group = pre_plot$simulated_group, skew = pre_plot$skew), data.frame(stage = "Post-filter", simulated_group = post_plot$simulated_group, skew = post_plot$skew) ) box_df$stage <- factor(box_df$stage, levels = c("Pre-filter", "Post-filter")) box_df$simulated_group <- factor(box_df$simulated_group, levels = c("normal", "left_skew", "right_skew")) ggplot(box_df, aes(x = simulated_group, y = skew, fill = stage)) + geom_boxplot(outlier.alpha = 0.25, position = position_dodge(width = 0.75)) + geom_hline(yintercept = -skewness_threshold, color = "#E41A1C", linetype = "dashed") + theme_classic() + scale_fill_manual(values = c("Pre-filter" = "#9ECAE1", "Post-filter" = "#3182BD")) + labs( x = "Simulated feature type", y = "Skewness value (per feature)", title = "Shift in feature-level skewness profiles after filtering" ) ```