name: coordexp-infer-eval-workflow description: Use when launching, repairing, auditing, or summarizing CoordExp inference, confidence scoring, duplicate-control, COCO/LVIS-proxy evaluation, Oracle-K analysis, or benchmark artifact provenance.
CoordExp Infer Eval Workflow
Use the repo's YAML-first production path.
Do not invent one-off CLI flags when an existing config already captures the run.
When running interactively, wrap noisy commands with rtk if useful, but keep PYTHONPATH=. and conda run -n ms python.
Primary References
- canonical docs:
docs/eval/WORKFLOW.mddocs/eval/CONTRACT.mddocs/ARTIFACTS.md
- reusable runtime seams:
src/infer/pipeline.py::run_pipelinesrc/infer/engine.py::InferenceEngine.infersrc/infer/artifacts.py::build_infer_summary_payloadsrc/eval/detection.py::evaluate_and_savesrc/eval/artifacts.py
- standard infer entrypoint:
scripts/run_infer.py
- confidence scoring:
scripts/postop_confidence.py
- single-view evaluation:
scripts/evaluate_detection.py
- one-run proxy bundle evaluation:
scripts/evaluate_proxy_detection_bundle.py
Default Flow
config + checkpoint + input JSONL
-> inference
-> gt_vs_pred.jsonl
-> score materialization
-> gt_vs_pred_scored.jsonl
-> evaluation and optional duplicate-control guard
-> metrics.json / metrics_guarded.json / per_image.json / matches.jsonl / summaries
Score materialization depends on the coordinate surface:
bbox_format: xyxywith coord tokens: run confidence post-op.- raw-text
xyxynorm1000: run confidence post-op with numeric-text span alignment; setinfer.mode: text,infer.pred_coord_mode: norm1000, and do not rely onauto. bbox_format: cxcy_logw_loghorcxcywh: do not run confidence post-op; use the unified pipeline's deterministic constant-score compatibility artifact only for checkpoints trained on that serialization.
Core Commands
Always run from repo root. Use YAML configs, not ad hoc shell overrides, for stable workflows.
Inference:
PYTHONPATH=. conda run -n ms python scripts/run_infer.py \
--config <infer_config.yaml>
Confidence post-op:
PYTHONPATH=. conda run -n ms python scripts/postop_confidence.py \
--config <postop_config.yaml>
Single evaluation:
PYTHONPATH=. conda run -n ms python scripts/evaluate_detection.py \
--config <eval_config.yaml>
Oracle-K analysis:
PYTHONPATH=. conda run -n ms python scripts/evaluate_oracle_k.py \
--config <oracle_k_config.yaml>
One-run proxy bundle evaluation:
PYTHONPATH=. conda run -n ms python scripts/evaluate_proxy_detection_bundle.py \
--config <bundle_eval_config.yaml>
COCO + LVIS Proxy Workflow
For COCO runs trained with LVIS proxy supervision:
- Infer once.
- Score once.
- Evaluate three GT views from the same scored artifact.
The standard view labels are:
coco_real: original COCO GT only; this is the benchmark-aligned headline.coco_real_strict: COCO GT plus strict same-extent LVIS proxies.coco_real_strict_plausible: broad analysis view; useful for recall, least comparable to standard COCO.
Use existing configs under configs/infer/, configs/postop/, configs/eval/, and configs/bench/. Only reach for old concrete configs such as the COCO-1024 val_200_lvis_proxy_* set when the user is working on that exact historical run:
configs/infer/coco_1024/val_200_lvis_proxy_merged.yamlconfigs/postop/coco_1024/val_200_lvis_proxy_merged.yamlconfigs/eval/coco_1024/val_200_lvis_proxy_bundle.yaml
Expected bundle outputs:
<run_dir>/eval_coco_real/<run_dir>/eval_coco_real_strict/<run_dir>/eval_coco_real_strict_plausible/<run_dir>/proxy_eval_bundle_summary.json
What To Verify
Before launch:
- infer config points to the intended
gt_jsonl - infer config uses the intended checkpoint or adapter shorthand
- root image directories resolve from config/provenance, especially when running from
temp/or sharded work dirs - prompt controls match training when required:
infer.prompt_variantinfer.object_field_orderinfer.object_ordering
- coordinate surface is intentional:
infer.modeinfer.pred_coord_modeinfer.bbox_format
- benchmark scope is explicit:
- dataset path
- slice such as
val200,limit=200, or full-val - decoding knobs and GPU launch shape when reporting timing
After inference:
<run_dir>/summary.jsonexists<run_dir>/gt_vs_pred.jsonlexists<run_dir>/resolved_config.jsonexists when using the YAML pipelineresolved_config.pathis present next togt_vs_pred.jsonlwhen downstream jobs need to recover the authoritative config- prompt/order settings in
summary.jsonandresolved_config.jsonmatch the config
After confidence post-op:
<run_dir>/confidence_postop_summary.jsonexists<run_dir>/pred_confidence.jsonlexists for confidence-scored paths<run_dir>/gt_vs_pred_scored.jsonlexists
After non-canonical constant-score compatibility scoring:
<run_dir>/gt_vs_pred_scored.jsonlexists- do not expect
pred_confidence.jsonlorconfidence_postop_summary.json
After evaluation:
- raw metrics exist:
metrics.jsonper_image.json
- guarded companions exist when
duplicate_control.enabled: true:metrics_guarded.jsonper_image_guarded.jsonduplicate_guard_report.json
- bundle summary exists when using proxy bundle eval:
<run_dir>/proxy_eval_bundle_summary.json
- each expected eval directory exists
- each eval directory contains
metrics.json - use the summary JSON as the default source for reporting cross-view metrics
For long or sharded runs:
- verify top-level merged artifacts and summaries, not just per-shard logs
- rerun only failed or missing shards when possible
- preserve canonical image roots or rewrite them explicitly before launching from scratch space
Reporting Guidance
When the user asks for performance:
- report
bbox_AP,bbox_AP50,bbox_AP75, and the main F1-ish metric if available - lead with
coco_real - clearly label
strictandstrict_plausibleas additive proxy views - state scope before comparing numbers:
val200,limit=200, first-200, full-val, proxy view, raw-text vs coord-token, checkpoint id, and repetition penalty if relevant - mention GT counts, kept/total prediction counts, or scorer repairs when they materially explain score shifts
- compare throughput only across compatible launch shapes; GPU count differences can make timing non-comparable even when accuracy is comparable
Failure Modes
- If
metrics: bothon a COCO proxy artifact seems to trigger LVIS-federated assumptions, inspectsrc/eval/detection.pyrouting and verify dataset-policy detection before trusting the output. - If eval behavior is unclear, inspect
src/eval/detection.py::EvalOptionsandsrc/eval/detection.py::evaluate_and_save;scripts/evaluate_detection.pyis a wrapper. - If infer behavior is unclear, inspect
src/infer/pipeline.py::run_pipelineandsrc/infer/engine.py::InferenceEngine.infer;scripts/run_infer.pyis a wrapper. - If images cannot be re-opened for visualization from derived artifacts, inspect
provenance.source_jsonl_dirin the canonical visualization resource. - If proxy-expanded GT counts look wrong, validate
metadata.coordexp_proxy_supervision.object_supervisionand the proxy-tier split before blaming the evaluator. - If raw-text predictions appear to collapse after scoring, verify the confidence post-op used the numeric-text path instead of coord-token geometry alignment.
- If a non-canonical bbox-format result looks strong or weak, confirm the checkpoint was trained against that exact serialization before treating it as evidence.
Avoid
- Do not re-run inference when only evaluation views changed.
- Do not compare proxy-expanded numbers against standard COCO baselines without labeling them.
- Do not compare
val200orlimit=200against full-val without saying so. - Do not treat guarded metrics as a replacement for raw model-output inspection; guarded artifacts are additive post-op views.
- Do not override config semantics with ad hoc shell flags unless the user explicitly asks for a one-off debug run.