name: bio-stats-ml-reporting description: Aggregate results, train ML models, and produce reports with validated references.
Bio Stats ML Reporting
When to use
- Aggregate results, train ML models, and produce reports with validated references.
Prerequisites
- Tools installed via pixi (see pixi.toml).
- Results tables and metadata are available.
Inputs
- results/.parquet or results/.tsv
- metadata.tsv
Outputs
- results/bio-stats-ml-reporting/models/
- results/bio-stats-ml-reporting/metrics.tsv
- results/bio-stats-ml-reporting/report.md
- results/bio-stats-ml-reporting/logs/
Steps
- Join outputs in DuckDB and build feature tables.
- Train baseline models and evaluate with cross-validation.
- Generate reports and validate references.
QC gates
- Model performance sanity checks pass.
- Reference validation passes.
- On failure: retry with alternative parameters; if still failing, record in report and exit non-zero.
Validation
- Verify input tables are readable and schema-consistent.
Tools
- duckdb v1.4.3
- scikit-learn v1.8.0
- xgboost v3.1.3
- crossrefapi v1.7.0
Paper summaries (2023-2025)
- summaries/ (include example use cases and tool settings used)
Tool documentation
- DuckDB - In-process analytical database for data aggregation
- scikit-learn - Machine learning library
- XGBoost - Gradient boosting framework
- Crossref API - Reference validation and metadata retrieval
References
- See ../bio-skills-references.md