name: omics-analysis-guide description: Three-tiered approach to omics data analysis (transcriptomics, proteomics) covering validated pipelines, standard workflows, and custom methods license: open
Omics Data Analysis Guide: Three-Tiered Approach
Metadata
Short Description: Comprehensive guide for analyzing omics data (transcriptomics, proteomics) using validated pipelines, standard workflows, or custom analysis methods.
Authors: HITS
Version: 1.0
Last Updated: December 2025
License: CC BY 4.0
Commercial Use: Allowed
Citations and Acknowledgments
If you use validated pipelines or tools (Option 1):
- Citation: Always cite the original publication associated with each tool or pipeline
- Acknowledgment: Cite the specific tools and methods used in your analysis
If you use standard workflows (Option 2):
- Acknowledgment Statement: "Analysis performed using standard omics data analysis workflows and best practices"
- Citation for RNA-seq analysis: Dobin A, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15-21. PMID: 23104886
- Citation for proteomics: Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 2008;26(12):1367-72. PMID: 19029910
Overview
This guide provides a three-tiered approach to omics data analysis, prioritizing validated pipelines and standard workflows before moving to custom analysis. Always start with Option 1 and proceed to subsequent options only if needed.
The guide covers:
- Transcriptomics: Bulk RNA-seq
- Proteomics: Pre-quantified protein abundance data (similar to bulk RNA-seq analysis)
Note: This guide focuses on analysis of already-quantified data. For raw data processing (alignment, quantification), refer to specialized tools and pipelines.
Key Concepts
Validated Pipeline vs. Standard Workflow vs. Custom Analysis
A validated pipeline is a specific tool with peer-reviewed benchmarking data demonstrating performance on data like yours (e.g., DESeq2 for RNA-seq counts, MaxQuant for label-free proteomics). A standard workflow is the canonical sequence of QC → normalization → statistical test → multiple-testing correction assembled from accepted community practice but tuned to your specific dataset. Custom analysis is bespoke statistical or computational modeling required when neither prior tier covers the data type or research question. The progression Option 1 → Option 2 → Option 3 trades reproducibility for flexibility — always exhaust earlier tiers first.
Missing Value Mechanisms (MCAR / MAR / MNAR)
Missing data in omics arises from three distinct mechanisms with different correct treatments. MCAR (Missing Completely At Random) means missingness is independent of any value — safe to impute with mean, median, or KNN. MAR (Missing At Random) means missingness depends on observed variables but not the unobserved value — KNN or model-based imputation is appropriate. MNAR (Missing Not At Random) means missingness depends on the missing value itself, typical in proteomics where low-abundance proteins drop below detection — requires left-censored imputation (minprob/QRILC) below the detection limit. Choosing the wrong mechanism systematically biases downstream statistics.
Test Assumptions and Test Selection
Parametric tests (Student's t-test, Welch's t-test) assume approximate normality and (for Student's) equal variances; they have higher power than non-parametric tests when assumptions hold. Non-parametric tests (Mann-Whitney U, permutation) make weaker assumptions and are correct under skewed distributions or small n, at the cost of statistical power. The choice depends on sample size (n < 10 favors non-parametric), normality (Shapiro-Wilk / Anderson-Darling at the feature level), variance homogeneity (Levene's test), and outlier prevalence.
Multiple Testing Correction
Omics analyses test thousands of features simultaneously. Without correction, expected false positives at α=0.05 across 20,000 genes is 1,000. Family-wise error rate (FWER) corrections like Bonferroni control the probability of any false positive but are conservative. False discovery rate (FDR) corrections like Benjamini-Hochberg control the expected proportion of false positives among reported significant features and are the standard for omics. Always report adjusted p-values, never raw p-values, when calling significance.
Decision Framework
Use this tree to choose the right analysis tier for your data:
Have you searched for a validated
pipeline matching your data type?
│
┌─────────────┴─────────────┐
│ │
NO YES
│ │
▼ ▼
Run Method 1 Did you find a validated
(literature) AND pipeline with benchmarks
Method 2 (consortia matching your data type
workflows) FIRST and biological question?
│
┌───────┴───────┐
│ │
YES NO
│ │
▼ ▼
OPTION 1: Is your data a
Use validated common type
pipeline (RNA-seq counts,
(e.g., DESeq2, pre-quantified
edgeR, MaxQuant) proteomics)?
│
┌───────┴───────┐
│ │
YES NO
│ │
▼ ▼
OPTION 2: OPTION 3:
Standard Custom analysis
workflow (consult
(QC → norm → statistician;
test → FDR) document
thoroughly)
Decision Table
| Data type | Sample size | Has validated pipeline? | Recommended tier | Specific tool / approach |
|---|---|---|---|---|
| Bulk RNA-seq counts | n ≥ 3/group | Yes (DESeq2, edgeR) | Option 1 | DESeq2 (negative binomial, default FDR < 0.05) |
| Pre-quantified proteomics, normal-distributed | n ≥ 5/group | Sometimes | Option 1 if pipeline matches; else Option 2 | limma or t-test + BH-FDR |
| Pre-quantified proteomics, MNAR-heavy | n ≥ 5/group | No (mechanism-specific) | Option 2 | minprob imputation → t-test or Mann-Whitney → BH-FDR |
| Small-cohort omics (n < 5) | n < 5 | Rarely | Option 2 with caution | Permutation test, report effect sizes; flag results as preliminary |
| Multi-omics integration | Variable | Limited | Option 3 | MOFA, DIABLO, or custom Bayesian model |
| Novel data type (e.g., spatial multi-omics) | Variable | No | Option 3 | Build from first principles; cross-validate |
| Time-series omics | n per timepoint | Sometimes (maSigPro, ImpulseDE2) | Option 1 if available; else Option 3 | maSigPro for transcriptomics; custom for proteomics |
Option 1: Search for Validated Analysis Methods (Recommended First)
1.1 Search for Validated Analysis Pipelines
IMPORTANT: You MUST complete BOTH Method 1 AND Method 2 before proceeding to Option 2. Do not skip Method 2 even if Method 1 finds no results.
Method 1: Literature Search for Best Practices
Search for validated analysis methods using web search tools or literature databases (PubMed, Google Scholar).
Search queries to try (use multiple):
"[DATA_TYPE]" "[ANALYSIS_TYPE]" validated pipeline best practices
"[DATA_TYPE]" analysis workflow "[ORGANISM]" published
"[DATA_TYPE]" "[TOOL_NAME]" validation benchmark comparison
Example for bulk RNA-seq:
"RNA-seq" "differential expression" validated pipeline human
"DESeq2" "edgeR" comparison validation RNA-seq
Example for proteomics:
"proteomics" "differential abundance" analysis validated methods
"proteomics" normalization imputation best practices
What to search for in results:
- Published papers with validated analysis pipelines
- Benchmark studies comparing different tools
- Best practices guides from major consortia (e.g., ENCODE, TCGA)
- Tool documentation with validation data
IMPORTANT: Spend adequate time searching literature. Look through at least the first 10-15 search results and check supplementary materials of relevant papers.
Method 2: Review Standard Analysis Workflows
Review established workflows from major consortia and publications:
- ENCODE RNA-seq analysis pipeline
- TCGA analysis protocols
- Published benchmark studies
What to Do with Results:
If you find validated pipelines or methods:
- Record the pipeline/method name and version
- Note the reference: Record the publication DOI/PubMed ID
- Record validation details: Benchmark results, recommended parameters, any limitations
- Document the workflow: Step-by-step analysis procedure
Example result format:
Data Type: Bulk RNA-seq
Analysis Goal: Differential expression
Pipeline: DESeq2 (v1.40.0)
Reference: Love MI, et al. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. PMID: 25516281
Validation: Validated in multiple benchmark studies, recommended for count data
Parameters: Default parameters, FDR < 0.05, log2FC > 1
If no validated pipelines found in BOTH Method 1 AND Method 2: Only then proceed to Option 2: Use Standard Workflows
Option 2: Use Standard Analysis Workflows
When to Use This Option?
- No validated pipelines found for your specific data type
- Need to build a custom workflow from standard components
- Working with common data types (RNA-seq, proteomics)
- Want to follow community best practices
2.1 Overview of Standard Workflows
RNA-seq (Bulk):
- Quality control
- Normalization and filtering (if count data, use DESeq2/edgeR normalization)
- Statistical analysis (differential expression)
- Multiple testing correction
- Functional enrichment (optional)
Proteomics (Pre-quantified):
- Quality control
- Missing value assessment and imputation
- Normalization
- Batch correction (if needed)
- Statistical analysis (differential abundance)
- Multiple testing correction
2.2 Essential Quality Control Steps
CRITICAL: Quality control must be performed before any statistical analysis. Poor data quality will lead to unreliable results regardless of statistical methods used.
Sample-Level Quality Control
Check for outlier samples:
- Use PCA + Isolation Forest to detect outlier samples
- Standardize data, perform PCA, then apply Isolation Forest
- Remove or investigate samples identified as outliers
Check sample correlation:
- Calculate correlation matrix between samples
- Low correlations (< 0.5) may indicate poor quality samples
- Remove samples with consistently low correlation
Check for batch effects:
- Use PCA + silhouette score to assess batch separation
- If silhouette score > 0.3, strong batch effect detected
- Apply batch correction (ComBat or similar) if needed
Feature-Level Quality Control
Assess missing value patterns:
- Calculate missing percentage per feature and per sample
- Test correlation between mean intensity and missingness
- Determine mechanism: MCAR (Missing Completely At Random), MAR (Missing At Random), or MNAR (Missing Not At Random)
- MNAR: Low intensity -> more missing (common in proteomics)
- MCAR: No relationship between intensity and missingness
Check feature detection consistency:
- Count how many features are detected in minimum number of samples
- Filter features detected in < 50% of samples (adjustable threshold)
2.3 Preprocessing Steps
Missing Value Imputation
CRITICAL: Choose imputation method based on missing value mechanism:
-
MNAR (Missing Not At Random): Use minimum probability imputation (minprob)
- Impute values below detection limit using normal distribution
- Parameters: downshift=1.8, width=0.3
-
MCAR/MAR (Missing Completely/At Random): Use KNN imputation
- Use k-nearest neighbors (default: k=5) to impute missing values
- More robust than mean/median imputation
-
Simple methods (if few missing values):
- Mean imputation: Replace with feature mean
- Median imputation: Replace with feature median
Normalization
For RNA-seq count data: Normalization is typically handled by DESeq2/edgeR (size factors).
For proteomics/continuous data:
- Median normalization: Scale each sample to global median
- Quantile normalization: Make distributions identical across samples
- Z-score normalization: Standardize to mean=0, std=1
- Total intensity normalization: Scale to total intensity
2.4 Statistical Analysis: Choosing the Right Test
CRITICAL: Always check statistical test assumptions before performing analysis. Using the wrong test can lead to incorrect conclusions.
Step 1: Check Test Assumptions
Key checks to perform:
-
Normality test:
- Use Shapiro-Wilk test for n < 50, Anderson-Darling for n >= 50
- Sample subset of features (100 features) for speed
- If >=70% of features are normal, data is considered normal
-
Variance homogeneity test:
- Use Levene's test to check equal variances
- If >=70% of features have equal variances, assume equal variance
-
Sample size check:
- n < 5: Very small, results unreliable
- n < 10: Small, prefer non-parametric tests
- n >= 10: Can use parametric tests if assumptions met
-
Outlier check:
- Calculate z-scores, flag values with |z| > 3
- If >5% outliers, prefer non-parametric tests
Test selection logic:
- n < 5: Permutation test or Mann-Whitney U test
- n < 10: Mann-Whitney U test (non-parametric)
- Normal + Equal variance: Student's t-test
- Normal + Unequal variance: Welch's t-test
- Non-normal: Mann-Whitney U test
Step 2: Perform Statistical Test
Implementation steps:
-
For each feature:
- Extract values for group1 and group2
- Remove NaN values
- Calculate means and log2 fold change
- Perform selected test (t-test, Welch's t-test, or Mann-Whitney U)
- Record statistic and p-value
-
Apply FDR correction:
- Use Benjamini-Hochberg procedure (FDR_BH)
- Adjust p-values for multiple testing
- Mark features with p_adj < 0.05 as significant
Key libraries:
scipy.stats: Statistical tests (ttest_ind, mannwhitneyu, shapiro, levene)statsmodels.stats.multitest: FDR correction (multipletests with method='fdr_bh')
2.5 Visualization
Volcano Plot:
- X-axis: Log2 fold change
- Y-axis: -Log10 adjusted p-value
- Color by significance: Upregulated (red), Downregulated (blue), Not significant (gray)
- Add threshold lines for fold change and p-value
PCA Plot (for quality control):
- Standardize data, perform PCA
- Plot PC1 vs PC2
- Label samples, check for outliers and batch effects
2.6 What to Do with Results
Once you have completed the standard workflow:
- Document all steps and parameters used
- Save intermediate results for reproducibility
- Validate results using independent methods when possible
- Report key findings with appropriate statistics
If standard workflows don't meet your needs: Proceed to Option 3: Custom Analysis
Option 3: Custom Analysis Methods (Last Resort)
When to Use This Option?
- Novel data type not covered by standard workflows
- Specialized research questions requiring custom approaches
- Integration of multiple omics data types
- Advanced statistical modeling requirements
3.1 General Principles
Essential Requirements:
-
Data Quality: Ensure high-quality data before custom analysis
- Perform all QC steps from Option 2
- Remove outliers and batch effects
- Validate technical replicates
-
Statistical Rigor:
- Always check test assumptions before analysis
- Use appropriate statistical tests for your data distribution
- Apply multiple testing correction (FDR)
- Validate assumptions
-
Reproducibility:
- Document all steps and parameters
- Use version control for code
- Save intermediate results
- Provide seed values for random processes
-
Validation:
- Cross-validation when applicable
- Independent validation set if available
- Compare with known results when possible
Best Practices:
- Start simple: Begin with basic analyses before complex methods
- Validate assumptions: Test normality, independence, etc.
- Use appropriate transformations: Log transform if needed
- Consider biological context: Interpret results in light of known biology
- Consult literature: Review similar studies for guidance
Quick Start Examples
Example 1: Bulk RNA-seq Differential Expression Analysis
Step 1: Quality Control
- Check for outlier samples using PCA + Isolation Forest
- Check sample correlation matrix
- Remove low-quality samples
Step 2: For RNA-seq count data, use DESeq2 (typically in R)
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = count_matrix, colData = sample_metadata, design = ~ condition)
dds <- DESeq(dds)
res <- results(dds, contrast=c("condition", "treatment", "control"))
Step 3: Functional Enrichment (optional)
- Use GSEA or GO enrichment tools (gseapy, etc.)
- Prepare ranked gene list from log2FC
- Run enrichment analysis
Example 2: Proteomics Differential Abundance Analysis
Step 1: Quality Control
- Check for outlier samples
- Assess missing values (determine mechanism: MCAR, MAR, or MNAR)
Step 2: Impute Missing Values
- If MNAR: Use minprob imputation
- If MCAR/MAR: Use KNN imputation
Step 3: Normalization
- Apply median or quantile normalization
Step 4: Check for Batch Effects
- Assess using PCA + silhouette score
- Apply batch correction if needed (ComBat or similar)
Step 5: Differential Abundance Analysis
- Check test assumptions (normality, variance, sample size)
- Select appropriate test (auto-select based on assumptions)
- Perform test, apply FDR correction
- Filter significant results (p_adj < 0.05)
Step 6: Visualization
- Create volcano plot
- Create PCA plot for QC
Data Type-Specific Considerations
RNA-seq (Bulk)
- Count data: Use DESeq2 or edgeR (negative binomial models)
- Normalization: Built into DESeq2/edgeR (size factors)
- Filtering: Remove low-count genes before analysis
- Multiple testing: Always apply FDR correction
- Statistical test: DESeq2/edgeR handle count data appropriately
Proteomics (Pre-quantified)
- Continuous data: Similar to normalized RNA-seq data
- Missing values: Common, especially for low-abundance proteins
- Assess missing mechanism (MCAR, MAR, MNAR)
- Use appropriate imputation method
- Normalization: Median, quantile, or total intensity normalization
- Statistical tests:
- Check normality and variance assumptions
- Use t-test if normal, Mann-Whitney if non-normal
- Always apply FDR correction
- Batch effects: Common in proteomics, check and correct if needed
Best Practices
-
Exhaust validated pipelines before building anything custom. Run both literature search and consortium-workflow review before falling back to bespoke analysis. Rationale: validated pipelines have peer-reviewed benchmarking; novel methods require their own validation effort and reduce reproducibility.
-
Perform sample-level QC before any statistical analysis. Use PCA + Isolation Forest for outlier detection, sample correlation matrices, and PCA + silhouette score for batch effects. Rationale: a single outlier sample or unrecognized batch effect can dominate test statistics and produce uninterpretable results regardless of the test chosen.
-
Diagnose the missing-value mechanism (MCAR / MAR / MNAR) before imputing. Check the correlation between mean intensity and missingness rate per feature. Rationale: imputing MNAR data with KNN biases low-abundance features upward; imputing MCAR data with minprob biases everything downward. Mechanism-aware imputation prevents systematic distortion.
-
Always check test assumptions, then choose the test — never the reverse. Run Shapiro-Wilk / Anderson-Darling for normality and Levene's for variance homogeneity on a representative feature subset. Rationale: applying a t-test to non-normal small-n data inflates type I error; defaulting to Mann-Whitney on well-behaved data wastes power.
-
Always apply FDR correction (Benjamini-Hochberg) for genome-wide tests. Report
p_adj(orq-value), not rawp. Rationale: with 20,000 genes tested at α=0.05, ~1,000 false positives are expected without correction — the result set is meaningless. -
Document every parameter and version, save intermediate outputs, and pin random seeds. Record tool version, parameter values, normalization method, imputation method, test choice, FDR threshold, and the seed for any stochastic step. Rationale: omics pipelines have many tunable knobs; without exact provenance the analysis cannot be reproduced or audited.
-
Validate findings on an independent dataset or with an orthogonal method whenever possible. Examples: confirm DE genes via qPCR, replicate in a public dataset (GEO, ArrayExpress), or compare across batches. Rationale: even FDR-controlled hits can be false positives driven by batch artifacts, contamination, or normalization choices.
Common Pitfalls
-
Skipping QC and going directly to statistics. Problem: Outlier samples and batch effects produce false signals that pass statistical tests, polluting the result list with artifacts. How to avoid: Always run sample-level PCA, correlation matrices, and outlier detection before any differential test. Treat QC as mandatory, not optional.
-
Imputing missing values with a one-size-fits-all method. Problem: Using mean imputation on MNAR proteomics data biases low-abundance proteins; using minprob on MCAR data biases everything below the detection limit downward. How to avoid: Diagnose the mechanism (correlation between intensity and missingness), then pick an appropriate imputer: minprob for MNAR, KNN for MCAR/MAR.
-
Using t-tests on non-normal or small-n data. Problem: Student's t-test assumes normality and (with pooled variance) equal variances; with n < 10 and skewed data, type I error inflates well above the nominal α. How to avoid: Run normality and variance tests first; use Welch's t-test for unequal variance, Mann-Whitney for non-normal, and permutation tests for n < 5.
-
Reporting raw p-values without multiple testing correction. Problem: Across thousands of features, raw p-values produce massive false discovery rates; the resulting "significant" gene lists are dominated by noise. How to avoid: Always apply Benjamini-Hochberg FDR (or BY for dependent tests) and report adjusted p-values. Set
p_adj < 0.05(orq < 0.05) as the significance threshold. -
Confusing fold change with statistical significance. Problem: A high log2 fold change at high p_adj is unreliable noise; a low log2 fold change at very low p_adj may be real but biologically negligible. How to avoid: Filter on both — typical thresholds are
|log2FC| > 1ANDp_adj < 0.05. Report effect sizes alongside p-values. -
Failing to correct for batch effects when present. Problem: Batch effects masquerade as biological signal, especially in proteomics and multi-cohort studies; PC1 ends up reflecting batch rather than condition. How to avoid: Check batch separation with PCA + silhouette score; if silhouette > ~0.3, apply ComBat, limma's
removeBatchEffect, or include batch as a covariate in the model. -
Treating Option 3 (custom analysis) as a shortcut. Problem: Jumping straight to custom methods without first running standard workflows skips peer-reviewed validation and makes results harder to publish and reproduce. How to avoid: Document a clear justification for why Options 1 and 2 are inadequate before moving to Option 3, and validate any custom method on simulated or held-out data.
References
Pipelines and Tools
- DESeq2: https://bioconductor.org/packages/release/bioc/html/DESeq2.html — Love MI, Huber W, Anders S. Genome Biology 2014; 15:550. PMID: 25516281
- edgeR: https://bioconductor.org/packages/release/bioc/html/edgeR.html — Robinson MD, McCarthy DJ, Smyth GK. Bioinformatics 2010; 26(1):139-40.
- STAR aligner: https://github.com/alexdobin/STAR — Dobin A, et al. Bioinformatics 2013; 29(1):15-21. PMID: 23104886
- MaxQuant: https://www.maxquant.org/ — Cox J, Mann M. Nat Biotechnol 2008; 26(12):1367-72. PMID: 19029910
- limma: https://bioconductor.org/packages/release/bioc/html/limma.html — Ritchie ME, et al. Nucleic Acids Res 2015; 43(7):e47.
- ComBat (sva package): https://bioconductor.org/packages/release/bioc/html/sva.html — Johnson WE, Li C, Rabinovic A. Biostatistics 2007; 8(1):118-27.
Consortium Best Practices
- ENCODE RNA-seq pipeline: https://www.encodeproject.org/data-standards/rna-seq/
- GTEx Analysis Protocol: https://gtexportal.org/home/methods
- TCGA Analysis Protocols: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/
Statistical Methods
- Benjamini-Hochberg FDR: Benjamini Y, Hochberg Y. J. R. Stat. Soc. B 1995; 57(1):289-300.
- Multiple imputation in proteomics review: Lazar C, et al. J Proteome Res 2016; 15(4):1116-25.
- scipy.stats: https://docs.scipy.org/doc/scipy/reference/stats.html
- statsmodels multiple testing: https://www.statsmodels.org/stable/stats.html
Data Repositories for Validation
- GEO: https://www.ncbi.nlm.nih.gov/geo/
- ArrayExpress: https://www.ebi.ac.uk/biostudies/arrayexpress
- PRIDE (proteomics): https://www.ebi.ac.uk/pride/
Remember: Always start with validated pipelines (Option 1), then move to standard workflows (Option 2), and only use custom analysis (Option 3) when necessary. Document all steps and parameters for reproducibility. Quality control is essential at every stage of analysis. Always check statistical test assumptions before performing analysis.