name: genotex-benchmark-guide description: "Benchmark for LLM agents on gene expression data analysis" metadata: openclaw: emoji: "🧫" category: "domains" subcategory: "biomedical" keywords: ["GenoTEX", "gene expression", "benchmark", "LLM agent", "bioinformatics", "GEO"] source: "https://github.com/Liu-Hy/GenoTEX"
GenoTEX Benchmark Guide
Overview
GenoTEX is a benchmark for evaluating LLM-based agents on gene expression data analysis tasks. It provides curated datasets from GEO (Gene Expression Omnibus) with ground-truth analysis pipelines, testing agents on data preprocessing, differential expression, enrichment analysis, and biological interpretation. Published at MLCB 2025 as an oral presentation.
Benchmark Structure
GenoTEX Benchmark
├── Data Collection
│ └── Curated GEO datasets with ground truth
├── Task Categories
│ ├── Data preprocessing (QC, normalization)
│ ├── Differential expression analysis
│ ├── Gene set enrichment analysis
│ ├── Clustering and classification
│ └── Biological interpretation
├── Evaluation
│ ├── Code correctness (executes without error)
│ ├── Statistical validity (appropriate tests)
│ ├── Result accuracy (vs ground truth)
│ └── Interpretation quality (biological insight)
└── Baselines
├── GPT-4 agent
├── Claude agent
└── Domain-specific fine-tuned models
Usage
from genotex import GenoTEXBenchmark
bench = GenoTEXBenchmark()
# List available tasks
tasks = bench.list_tasks()
for task in tasks[:5]:
print(f"Task: {task.id}")
print(f" Dataset: {task.geo_accession}")
print(f" Category: {task.category}")
print(f" Difficulty: {task.difficulty}")
# Get a specific task
task = bench.get_task("GSE12345_DEG")
print(f"Description: {task.description}")
print(f"Input files: {task.input_files}")
print(f"Expected output: {task.expected_output_type}")
Running Evaluations
# Evaluate an agent on GenoTEX
from genotex import evaluate_agent
results = evaluate_agent(
agent_fn=my_agent_function,
tasks="all", # or specific task IDs
timeout_per_task=300, # seconds
)
print(f"Tasks completed: {results.completed}/{results.total}")
print(f"Code correctness: {results.code_correct_rate:.1%}")
print(f"Statistical validity: {results.stats_valid_rate:.1%}")
print(f"Result accuracy: {results.accuracy:.3f}")
Task Examples
# Example: Differential Expression Analysis
task = {
"id": "GSE12345_DEG",
"description": "Identify differentially expressed genes "
"between treatment and control groups in "
"this RNA-seq dataset.",
"input": "GSE12345_counts.csv", # Raw count matrix
"metadata": "GSE12345_metadata.csv", # Sample info
"expected": {
"method": "DESeq2 or limma-voom",
"output": "DEG table with log2FC, p-value, adj.p",
"ground_truth": "GSE12345_deg_truth.csv",
},
}
# Example: Gene Set Enrichment
task = {
"id": "GSE12345_GSEA",
"description": "Perform gene set enrichment analysis on "
"the DEGs and identify enriched pathways.",
"input": "GSE12345_deg_results.csv",
"expected": {
"method": "fgsea, clusterProfiler, or enrichR",
"output": "Enriched pathways with NES and FDR",
},
}
Use Cases
- Agent evaluation: Test bioinformatics agents on real tasks
- Method comparison: Compare LLM agents on genomics
- Benchmark development: Extend with new GEO datasets
- Teaching: Standard tasks for bioinformatics education
- Tool development: Test new analysis pipelines