name: llm-evaluation author: JM Labs (Javier Montaño) version: 1.0.0 description: > Model output quality assessment, hallucination detection, benchmark suites. [EXPLICIT] Trigger: "llm evaluation" allowed-tools:
- Read
- Write
- Glob
- Grep
- Bash
Llm Evaluation
"Method over hacks."
TL;DR
Model output quality assessment, hallucination detection, benchmark suites. [EXPLICIT]
Procedure
Step 1: Discover
- Gather context and requirements
Step 2: Analyze
- Evaluate options per Constitution XIII/XIV
Step 3: Execute
- Implement with evidence tags
Step 4: Validate
- Verify quality criteria met
Quality Criteria
- Evidence tags applied
- Constitution-compliant
- Actionable output
Usage
Example invocations:
- "/llm-evaluation" — Run the full llm evaluation workflow
- "llm evaluation on this project" — Apply to current context
Assumptions & Limits
- Assumes access to project artifacts (code, docs, configs) [EXPLICIT]
- Requires English-language output unless otherwise specified [EXPLICIT]
- Does not replace domain expert judgment for final decisions [EXPLICIT]
Edge Cases
| Scenario | Handling |
|---|---|
| Empty or minimal input | Request clarification before proceeding |
| Conflicting requirements | Flag conflicts explicitly, propose resolution |
| Out-of-scope request | Redirect to appropriate skill or escalate |