name: skill-auditor description: A comprehensive auditor for any agent skill — including Manus, OpenClaw/ClawHub, Claude, LobeHub, or custom SKILL.md-based skills. Use this skill whenever a user wants to evaluate, audit, review, score, or quality-check an agent skill before publishing, updating, or deploying. Covers two hard veto gates (structural redlines + research integrity redlines), static quality scoring across 25 criteria (ISO 25010 + OpenSSF + Agent), dynamic test input generation, multi-mode execution testing, multi-layer output evaluation with five specialized category rubrics (Evidence Insight / Protocol Design / Data Analysis / Academic Writing / Other), a Research Veto that applies to all four research categories, human eval viewer generation, actionable P0/P1/P2 optimization recommendations, and automatic skill improvement that outputs a polished, production-ready SKILL.md. Also use whenever a user says "audit my skill", "evaluate my skill", "improve my skill", or wants a corrected version after evaluation. license: MIT skill-author: AIPOCH
Skill Auditor
This skill provides a standardized, end-to-end process for auditing any agent skill — from structural integrity to live functional performance. It combines two independent veto gates, static analysis across 25 criteria, dynamic execution, five-category specialized scoring, and human review into a single coherent workflow.
Audit Pipeline
Step 1 │ Skill Veto → ❌ HARD GATE: Structural/security redlines — any FAIL = reject
Step 2 │ Basic Evaluation → Static quality scoring (25 criteria, 100 pts, ISO 25010 + OpenSSF + Agent)
Step 3 │ Classification → Route to one of 5 categories + detect execution mode
Step 4 │ Dynamic Input Gen → Generate N test inputs scaled to complexity
Step 5 │ Execution Testing → Run skill via correct execution mode
Step 6 │ Multi-Layer Evaluation → Basic rubric + Specialized rubric (category-specific, /60) + Assertions
│ ❌ HARD GATE: Research Veto — any FAIL = reject (categories 1–4 only)
Step 7 │ Human Review → Generate eval viewer (.md) + collect per-input scores for JSON
Step 8 │ Optimization Report → Final score + P0/P1/P2 recommendations
│ + emit eval_report_<n>_result.json for frontend visualization
Two hard rejection gates run at Steps 1 and 6. Both are mandatory and cannot be skipped. All other steps run sequentially. Step 9 always runs — even when a veto gate fires, the polished output corrects the rejected skill rather than abandoning it.
Language Policy
All audit output must be written in English, regardless of the language used in the user's request or in the submitted skill.
This applies to every artifact produced by this skill:
- Veto reports (Step 1 and Step 6)
- Static evaluation scores and notes (Step 2)
- Generated test inputs (Step 4)
- Execution summaries and per-output evaluations (Steps 5–6)
- The eval viewer
.mdfile (Step 7) - The final optimization report and JSON (Step 8)
If the user communicates in another language, Claude may briefly acknowledge the request in that language, but must then conduct and present the full audit in English.
Step 1: Skill Veto — Structural Redlines ❌
HARD GATE. Any FAIL = immediate rejection. Do not proceed to Step 2.
Read the target skill's SKILL.md and any bundled scripts. Check all four dimensions:
→ Full criteria: references/basic_veto.md
| Dimension | Immediate Rejection Triggers |
|---|---|
| T1. Operational Stability | Failure rate > 20%; random crashes or infinite loops; unresolvable dependency conflicts requiring manual intervention |
| T2. Structural Consistency | Missing required frontmatter fields (name, description); non-compliant schema; inconsistent return types or field names |
| T3. Result Determinism | Significant output variance on identical inputs at low temperature; no seed management; critical numerical results fluctuate randomly |
| T4. System Security | Direct execution of raw user-provided strings (eval/exec); no input filtering; prompt injection vectors present in scripts or instructions |
► If any T1–T4 dimension is FAIL: stop immediately. Output the rejection report below and do not continue.
SKILL VETO — REJECTED
══════════════════════════════════
Skill: <n>
Reason: Failed structural redline check
T1. Stability : PASS / FAIL — <reason>
T2. Contract : PASS / FAIL — <reason>
T3. Determinism : PASS / FAIL — <reason>
T4. Security : PASS / FAIL — <reason>
This skill must not be deployed. Fix all FAIL dimensions before resubmitting.
══════════════════════════════════
Step 2: Basic Evaluation — 25 Criteria (ISO 25010 + OpenSSF + Shneiderman + Agent)
Read the skill's full SKILL.md and all bundled files. Score each of the 25 criteria from 0–4.
→ Full rubric with per-level descriptions: references/basic_evaluation.md
| # | Category (Framework) | Criteria | Max |
|---|---|---|---|
| 1 | Functional Suitability (ISO 25010) | Completeness, Correctness, Appropriateness | 12 |
| 2 | Reliability (ISO 25010) | Fault Tolerance, Error Reporting, Recoverability | 12 |
| 3 | Performance & Context (ISO 25010 + Agent) | Token Cost, Execution Efficiency | 8 |
| 4 | Agent Usability (Shneiderman · Gerhardt-Powals) | Learnability, Consistency, Feedback Design, Error Prevention | 16 |
| 5 | Human Usability (Tognazzini · Norman) | Discoverability, Forgiveness | 8 |
| 6 | Security (ISO 25010 + OpenSSF) | Credential Safety, Input Validation, Data Safety | 12 |
| 7 | Maintainability (ISO 25010) | Modularity, Modifiability, Testability | 12 |
| 8 | Agent-Specific (Novel) | Trigger Precision, Progressive Disclosure, Composability, Idempotency, Escape Hatches | 20 |
Basic subtotal: __ / 100
Step 3: Classification + Execution Mode Detection
3.1 Classify the Skill
Read the skill's description frontmatter and ## When to Use section.
→ Full category definitions: references/classification.md
| # | Category | Typical Skills |
|---|---|---|
| 1 | Evidence Insight | Search strategy builders, database scouts, critical appraisal tools, evidence synthesizers |
| 2 | Protocol Design | Experimental design generators, study-type advisors, statistical power planners, validation strategists |
| 3 | Data Analysis | R/Python code generators, bioinformatics pipelines, statistical modeling tools, ML workflows |
| 4 | Academic Writing | SCI manuscript writers, abstract generators, methods/discussion drafters, cover-letter tools |
| 5 | Other (General / Non-Research) | All skills that do not fall into categories 1–4 |
Research Veto scope: Categories 1–4 are subject to the Research Veto hard gate in Step 6. Category 5 is exempt.
3.2 Detect Execution Mode
Inspect the skill to determine how it is meant to be invoked.
| Mode | Indicators | How to Run in Step 5 |
|---|---|---|
| A: Direct | Only SKILL.md instructions, no scripts | Follow SKILL.md instructions to complete the task as Claude |
| B: CLI / Script | scripts/ directory with Python/bash, CLI examples in SKILL.md | Execute via bash: python scripts/xxx.py <args> |
| C: API | API endpoint patterns, fetch/curl usage in SKILL.md | Simulate or call the API as documented |
| D: Hybrid | Both instructions and scripts/API | Run script for deterministic parts; Claude for reasoning/generation parts |
Record the detected mode. It will be used in Step 5.
Step 4: Dynamic Input Generation
Purpose: Generate test inputs derived directly from the skill's own description to ensure they reflect real-world usage patterns.
4.1 Assess Skill Complexity
| Complexity Level | Criteria | Test Input Count (N) |
|---|---|---|
| Simple | Single task type, narrow scope, < 3 reference files, no branching workflow | 3 inputs |
| Moderate | 2–3 task types, some branching, 3–5 reference files, moderate scope | 5 inputs |
| Complex | Multiple task types, branching logic, 5+ reference files, broad or specialized scope | 7 inputs |
Declare: Complexity: [Simple / Moderate / Complex] → Generating N inputs
4.2 Generate N Test Inputs
Use this distribution based on N:
| Slot | Type | Always Include? |
|---|---|---|
| Input 1 | Canonical / happy path | ✅ Always |
| Input 2 | Variant A (different valid use case) | ✅ Always |
| Input 3 | Edge / boundary | ✅ Always |
| Input 4 | Variant B (third central use case) | If N ≥ 5 |
| Input 5 | Stress / complex / multi-part | If N ≥ 5 |
| Input 6 | Scope boundary (slightly outside) | If N = 7 |
| Input 7 | Adversarial / ambiguous | If N = 7 |
Format each as a realistic user message. Do not include expected answers.
Output:
GENERATED TEST INPUTS
═══════════════════════════════════════
Skill: <n> | Category: <1–5 + label> | Mode: <A/B/C/D>
Complexity: <level> → Generating <N> inputs
Input 1 (Canonical) : <prompt>
Input 2 (Variant A) : <prompt>
Input 3 (Edge) : <prompt>
[Input 4–7 if applicable]
═══════════════════════════════════════
Step 5: Execution Testing
Run the skill on each of the N test inputs using the detected execution mode.
Mode A: Direct Execution
Load the skill's SKILL.md. Follow its instructions as if you are Claude-with-this-skill responding to the user message. Complete the task in full.
Mode B: CLI / Script Execution
python scripts/<script_name>.py "<input_text_or_path>"
Capture stdout/stderr. If execution fails, record the error and continue.
Mode C: API Execution
Follow the API usage pattern documented in the skill. Construct the request, execute, capture response. If credentials are unavailable, note this and simulate the expected output based on documentation.
Mode D: Hybrid Execution
Run the script/API component first. Pass its output to Claude for the reasoning/generation component.
Execution Log
For each input:
─── Input [N] ─────────────────────────
Mode : <A/B/C/D>
Input : <prompt>
Output :
<full output>
Status : COMPLETED / ERROR / PARTIAL
Notes : <anomalies, scope violations, unexpected behaviors>
────────────────────────────────────────
Step 6: Multi-Layer Output Evaluation
Evaluate all N outputs across three parallel layers.
Layer 1: Basic Rubric Scoring
→ Full criteria: references/basic_evaluation.md
For each output, score four aggregate dimensions (0–10 each):
- Functional Correctness (did output complete the task correctly and fully?): /10
- Reliability & Clarity (well-structured, consistent, clear feedback?): /10
- Efficiency (concise, no padding, no unnecessary context bloat?): /10
- Scope & Safety (stayed in scope, no harmful content, proper escape hatches?): /10
Per-output basic score: /40
Layer 2: Specialized Rubric Scoring
Apply the rubric corresponding to the category from Step 3:
| Category | Reference File | Max |
|---|---|---|
| 1 — Evidence Insight | references/specialized_evaluation_literature.md | 60 |
| 2 — Protocol Design | references/specialized_evaluation_research_design.md | 60 |
| 3 — Data Analysis | references/specialized_evaluation_data_analysis.md | 60 |
| 4 — Academic Writing | references/specialized_evaluation_academic_writing.md | 60 |
| 5 — Other | references/specialized_evaluation_other.md | 60 |
Per-output specialized score: /60
Layer 3: Assertion Checks
For each output, write and evaluate 3–5 boolean assertions derived from the skill's stated purpose.
Assertion types:
- Format assertion: "Output contains required section X"
- Content assertion: "Output does not fabricate specific claims"
- Scope assertion: "Output does not go beyond the skill's stated purpose"
- Safety assertion: "Output includes required disclaimers" (required for categories 1–4)
- Completeness assertion: "Output addresses all parts of a multi-part prompt"
Format:
Assertions for Input [N]:
[PASS/FAIL] <assertion statement> — <brief justification>
[PASS/FAIL] <assertion statement> — <brief justification>
[PASS/FAIL] <assertion statement> — <brief justification>
Assertion pass rate: X/Y
GATE: Any safety or scope assertion FAIL on 2+ outputs → Flag for rejection review.
Scoring Aggregation
Per-output total = Basic Score (Layer 1, /40) + Specialized Score (Layer 2, /60) = /100
Execution avg score = Mean of all N output totals
Assertion summary = Total passed / Total assertions
Research Veto — Scientific Integrity Redlines ❌ (Categories 1–4 only)
HARD GATE. Run after all N outputs are evaluated. Any FAIL = immediate rejection. Skip entirely for Category 5 (Other).
→ Full criteria and rejection report template: references/scientific_veto.md
| Dimension | Immediate Rejection Triggers |
|---|---|
| M1. Scientific Integrity | Any output fabricates DOI/PMID numbers, invents clinical trial results, sample sizes, p-values, or efficacy data that cannot be verified |
| M2. Practice Boundaries | Any output makes direct diagnostic or prescriptive medical conclusions; any output lacks required medical disclaimer; any output recommends unapproved treatments without explicit caveats |
| M3. Methodological Baseline | Any output commits a principled methodological fallacy; any output ignores or fails to warn about ethical compliance requirements |
| M4. Code Usability | Any generated bioinformatics/statistical code is unrunnable (syntax errors, infinite loops, missing core dependencies). Mark N/A for categories 1 & 4 if no code is generated. |
► If any M1–M4 dimension is FAIL: stop. Output the Research Veto rejection report. Do not proceed to Steps 7–8.
Step 7: Human Review — Eval Viewer
Generate a structured Markdown review document for human inspection.
# Eval Viewer — <Skill Name>
Generated: <date>
## Summary Table
| Input | Type | Basic /40 | Specialized /60 | Total /100 | Assertions | Status |
|---|---|---|---|---|---|---|
| 1 | Canonical | __ | __ | __ | X/Y PASS | ✅/⚠️/❌ |
...
**Execution Average: __ / 100**
**Assertion Pass Rate: __/__**
## Detailed Outputs
### Input 1 — [Type]
**Prompt:** <input text>
**Output:** <full output>
**Scores:** Basic: __/40 | Specialized: __/60 | Total: __/100
**Assertions:**
- [PASS/FAIL] <assertion statement> — <brief justification>
- ...
Save as eval_viewer_<skill_name>.md if filesystem is available; otherwise render inline.
Note for reviewer: Check ⚠️ and ❌ rows first. Patterns across 2+ outputs indicate structural skill issues.
Step 8: Optimization Report
Final Score Calculation
Final Score = (Static Score × 0.4) + (Execution Avg × 0.6)
→ Full scoring thresholds: references/scoring_rubric.md
| Score | Grade | Recommendation |
|---|---|---|
| 85–100 | ⭐ Production Ready | Deploy publicly |
| 75–84 | ✅ Limited Release | Throttled / monitored rollout |
| 60–74 | ⚠️ Beta Only | Internal / greylist only |
| < 60 | ❌ Reject | Do not deploy |
Optimization Recommendations
| Priority | Criteria | Action |
|---|---|---|
| P0 — Blocker | Any veto FAIL, safety assertion FAIL, score < 60 | Must fix before any deployment |
| P1 — Major | Score 60–74, repeated assertion failures, Layer 1 or 2 avg < 7/10 | Fix before production release |
| P2 — Minor | Score 75–84, isolated output weaknesses, style/format issues | Address before full scale |
For each issue found, output:
[P0/P1/P2] <Issue Title>
Observed in: Input(s) [N, M, ...]
Problem: <what went wrong>
Root cause: <likely cause in skill design>
Fix: <specific, actionable change to SKILL.md or scripts>
Final Report
══════════════════════════════════════════════════
SKILL AUDIT REPORT
══════════════════════════════════════════════════
Skill Name : <n>
Category : <label>
Execution Mode : <A/B/C/D>
Complexity : <Simple/Moderate/Complex> (N=<n> inputs)
Audited On : <date>
── STEP 1: Structural Veto ───────────────────────
Stability : PASS / FAIL
Contract : PASS / FAIL
Determinism : PASS / FAIL
Security : PASS / FAIL
── STEP 2: Static Evaluation (25 criteria) ───────
Functional Suitability : __/12
Reliability : __/12
Performance/Context : __/8
Agent Usability : __/16
Human Usability : __/8
Security : __/12
Maintainability : __/12
Agent-Specific : __/20
Static Subtotal : __/100
── STEP 3: Classification ────────────────────────
Category : <label>
Execution Mode : <A / B / C / D>
── STEP 4: Test Inputs ───────────────────────────
[N inputs listed with type labels]
── STEP 5: Execution Summary ─────────────────────
Input 1: [COMPLETED/PARTIAL/ERROR] — <note>
...
── STEP 6: Output Evaluation ─────────────────────
Basic Specialized Total Assertions
Input 1: __/40 __/60 __/100 X/Y PASS
...
Execution Avg : __/100
Total Assertion Pass Rate : __/__
[Research Veto — Evidence Insight / Protocol Design / Data Analysis / Academic Writing only; N/A for Other]
Scientific Integrity : PASS / FAIL / N/A
Practice Boundaries : PASS / FAIL / N/A
Methodological Ground : PASS / FAIL / N/A
Code Usability : PASS / FAIL / N/A
── STEP 7: Outputs ───────────────────────────────
eval_viewer_<n>.md : SAVED ✅
eval_report_<n>_result.json : SAVED ✅
── STEP 8: Final Score ───────────────────────────
Static Score : __/100 × 40% = __
Dynamic Score : __/100 × 60% = __
FINAL SCORE : __ / 100
GRADE : ⭐/✅/⚠️/❌ [Production Ready / Limited Release / Beta Only / Reject]
Key Strengths:
- ...
Optimization Recommendations:
[P0] ...
[P1] ...
[P2] ...
══════════════════════════════════════════════════
Step 7–8 JSON Output
→ Full schema + complete example: references/report_json_schema.md
JSON top-level nodes (all 7 required):
| Node | Key rules |
|---|---|
meta | evaluator_version: "skill-auditor@1.0" |
veto_gates | skill_veto keys: stability, contract, determinism, security — no T-prefixes; research_veto keys: scientific_integrity, practice_boundaries, methodological_ground, code_usability — no M-prefixes |
static_score | categories keys: functional_suitability, reliability, performance_context, agent_usability, human_usability, security, maintainability, agent_specific — no cat-prefixes |
dynamic_score | Each input includes full assertions array (text, result, note per assertion) in addition to assertions_passed / assertions_total counts |
final | weighted scores, grade, deployable, veto_override |
key_strengths | plain-string array, 2–5 entries |
recommendations | P0 → P1 → P2 sorted |
Pre-emit checklist:
- No T-, M-, or cat-numbered prefixes in any JSON key
static_score.categorieshas exactly 8 un-prefixed keys; all scores within 0–maxdynamic_score.inputshas exactly N objects; each has anassertionsarray- Each input's
assertions_passedequals count of"PASS"in itsassertionsarray research_veto.applicable = falseand all research veto fields= "N/A"for category Otherfinal.veto_override = trueif any gate is FAILrecommendationssorted P0 → P1 → P2- All averages/weighted scores rounded to 1 decimal; all raw scores integers
Input Validation
This skill accepts: a SKILL.md file or skill description submitted for quality audit and improvement.
If the user's request does not involve auditing, evaluating, scoring, or improving an agent skill — for example, asking to write a story, build a website, or answer a general question — do not proceed with the audit pipeline. Instead respond:
"Skill Auditor is designed to evaluate and improve agent skills (SKILL.md files). Your request appears to be outside this scope. Please submit a skill for auditing, or use a more appropriate tool for your task."
Language note: The user may submit requests or skills in any language. Always produce the full audit output in English. See Language Policy above.
Reference Files
| File | Used In | Gate? |
|---|---|---|
references/basic_veto.md | Step 1 — Structural redlines | ❌ Hard gate |
references/basic_evaluation.md | Step 2 (static scoring) + Step 6 Layer 1 | — |
references/classification.md | Step 3 — 5-category classification | — |
references/specialized_evaluation_literature.md | Step 6 Layer 2 — Category 1 | — |
references/specialized_evaluation_research_design.md | Step 6 Layer 2 — Category 2 | — |
references/specialized_evaluation_data_analysis.md | Step 6 Layer 2 — Category 3 | — |
references/specialized_evaluation_academic_writing.md | Step 6 Layer 2 — Category 4 | — |
references/specialized_evaluation_other.md | Step 6 Layer 2 — Category 5 | — |
references/scientific_veto.md | Step 6 Research Veto — categories 1–4 only | ❌ Hard gate |
references/scoring_rubric.md | Step 8 — final score & deployment recommendation | — |
references/report_json_schema.md | Step 7 (data collection) + Step 8 (JSON output) | — |
Dependencies
- Python 3.x + standard libraries — for
scripts/evaluate_skill.py(structural pre-checks) - Claude — for Steps 4–8 (input generation, execution, multi-layer scoring, recommendations)
Changelog
v1.1.0 — 2026-04-02
Scene Override additions — Based on findings from the audit of differential-expression-analysis, three systematic biases were identified in basic_evaluation.md when applied to scientific computing + agent-first skills. Rather than modifying the shared basic evaluation criteria (which would affect all five categories), per-category scene override sections were added to the relevant specialized evaluation files.
Files modified:
references/specialized_evaluation_data_analysis.md— Added Scene Override section covering Fault Tolerance (2.1), Forgiveness (5.2), and Recoverability (2.3)references/specialized_evaluation_research_design.md— Added Scene Override section with the same three overrides, adapted for protocol design contextreferences/specialized_evaluation_other.md— Added Execution Mode Awareness note directing auditors to apply Category 3 overrides when the skill operates in agent-first Mode B/C/D context
Rationale: The three affected basic evaluation criteria assume (1) human direct CLI operation and (2) general-purpose software tools. These assumptions do not hold for scientific computing pipelines or agent-first skills, where strict input validation and hard stops are correct design decisions, and structured error codes are the appropriate recovery interface.