Agents.md ― spec _version: 0.3.3 (2025-07-21)
TL;DR
anml-expis a pip-installable anomaly-detection playground with locked dependencies (uv.lock), artefact registry, strict dataset hashing, and a structured hardware descriptor in benchmark outputs.
0 · Purpose & Non-Goals
This repository remains a rapid-prototyping and benchmarking framework for anomaly-scoring / detection algorithms.
Out of scope:
- Streaming / online detection
- Production serving or long-horizon monitoring
- Adversarial-attack tooling, drift dashboards, model-card generation
1 · Big Picture
LLM-powered agents collaborate with human maintainers to
- Implement a diverse zoo of anomaly-detection models.
- Expose a uniform API and artefact registry for seamless benchmarking.
- Automate dataset ingestion (SHA-256 verified), metric computation, and result logging.
- Guard code quality with linting, typing, tests, docs, and a locked
dependency graph (
uv.lock).
2 · Canonical Folder Layout
. ├── src/ │ └── anml_exp/ │ ├── init.py │ ├── models/ │ ├── data/ │ ├── benchmarks/ │ ├── registry/ # NEW: model artefact versioning (#43) │ ├── resources/ │ └── cli.py ├── tests/ ├── docs/ ├── pyproject.toml ├── uv.lock # NEW: reproducible dependency lockfile (#42) └── README.md
Historic note – the hidden
.agents/folder mentioned in spec v0.2 has been retired; helpers live inanml_exp/benchmarks/andanml_exp/registry/.
3 · Common Schema & Interfaces
3.1 Base Model API
Every model must subclass
anml_exp.models.base.BaseAnomalyModel and implement:
| Method / Property | Signature | Notes |
|---|---|---|
fit | def fit(self, X, y=None) -> Self | |
score_samples | def score_samples(self, X) -> NDArray[float] | Higher ⇒ more anomalous. |
predict | def predict(self, X, *, threshold=None) -> NDArray[int] | |
decision_threshold | @property def decision_threshold(self) -> float | |
save / load | optional | Use anml_exp.registry for artefact versioning. |
anml_exp.registry stores model binaries and metadata under a semantic version
(MAJOR.MINOR.PATCH) with SHA-256 digests.
3.2 Dataset Registry
from anml_exp.data import load_dataset
X_train, y_train = load_dataset("kddcup99", split="train")
• Each dataset module must declare SHA256 hashes for every file.
• load_dataset verifies each hash before extraction; mismatch ⇒ HashError
(#44).
• Deterministic splits (seed = 42).
⸻
3.3 Metrics & Result Schema
Benchmarks report:
• ROC-AUC
• PR-AUC (Average Precision)
• F1 @ best Youden threshold
• Mean wall-time per 1 000 samples
Each run is saved to
results/{exp_name}/{model_name}.json
and must validate against
anml_exp/resources/results-schema.json.
Structured hardware descriptor (#45)
"hardware": {
"device_type": "GPU",
"vendor": "NVIDIA",
"model": "RTX A6000",
"driver": "535.104",
"num_devices": 1,
"notes": "desktop workstation"
}
Minimal example
{
"$schema": "./results-schema.json",
"dataset": "kddcup99",
"model": "isolation_forest",
"model_version": "0.1.0",
"n_samples": 145586,
"seed": 42,
"hardware": {
"device_type": "CPU",
"vendor": "Intel",
"model": "i7-1185G7",
"driver": "N/A",
"num_devices": 1,
"notes": "laptop"
},
"roc_auc": 0.921,
"pr_auc": 0.604,
"f1": 0.432,
"threshold": 0.79,
"fit_time": 1.23,
"score_time": 0.02,
"params": {"n_estimators": 100, "max_samples": "auto"},
"artefact_digest": "sha256:13f0…"
}
⸻
4 · Agent Roles
Agent Intent Success Criteria
Builder Generate / extend code (models, loaders, registry). API compliance, passes tests, artefact registered.
Evaluator Run benchmarks & aggregate metrics. JSON validates, hardware descriptor correct.
Reviewer Static analysis, typing, docs, tests, perf. CI green (ruff, mypy, pytest, hash check, lock diff).
⸻
5 · Contribution Workflow
flowchart TD
draft["Builder → Draft PR"]
review["Reviewer → CI checks"]
maintainer["Human → Merge / Request changes"]
draft --> review --> maintainer
CI additionally ensures:
• uv sync --frozen produces identical env (#42).
• Dataset SHA-256s match declared values (#44).
⸻
6 · Coding Standards
• Dependency lock: uv.lock is the single source of truth.
• PEP 8 via ruff; PEP 561 typing (mypy --strict).
• Speed up mypy in CI by caching `.mypy_cache` and installing
`mypy[faster-cache]` via `uv pip` to ensure the local
environment runs the optimized wheels.
• NumPy-style docstrings.
• pyproject.toml + uv.lock define mandatory and optional extras.
⸻
7 · Testing Strategy
• Unit + property tests.
• Hash-verification tests for every dataset file.
• CI fails if uv lock --check detects drift.
• Perf suite (tests/perf/) skipped in CI.
⸻
8 · Installation & Quick-Start
# Reproducible dev install
uv sync --frozen
pip install -e ".[torch,plot]"
# After release:
pip install anml-exp[torch,plot]
CLI:
anml-exp benchmark --dataset toy-blobs \
--model isolation_forest \
--output results/demo.json
⸻
9 · Road-Map
Milestone Owner Exit Criteria
M0 – Skeleton Builder Base class, dataset registry (SHA-256), artefact registry, CI, uv.lock.
M1 – Classical Benchmark Evaluator 3 tabular datasets; JSON outputs pass new schema.
M2 – Deep Models Builder AutoEncoder, DeepSVDD, USAD registered & versioned.
M3 – Time-Series Support Builder + Evaluator Loader + STOMP baseline + benchmarks.
⸻
10 · Open Questions
1. Unified config system (omegaconf) – still pending.
2. Preferred experiment tracker (mlflow, wandb, plain JSON).
3. CPU vs GPU determinism in CI.
4. Sandboxing policy for code-gen agents.
⸻
11 · Meta
• _spec_version bumped → 0.3.3 (adds #42–#45).
• See CONTRIBUTING.md for human-targeted guidelines.
• results-schema.json is the machine-readable contract.
Last updated – 2025-07-20 @ 20:55 AEST