name: harden description: > Audits any substantial deliverable — documents, grant sections, strategies, protocols, skills, configs, roadmaps, plans — against domain-specific quality criteria before delivery. Searches the web for best practices, runs dual-pass thematic and structural audit with severity classification, fixes all issues, and iterates until convergence. Activates when producing deliverables exceeding 4 paragraphs or 20 lines, or when the user says "harden", "audit", "stress test", or "review before delivering", or when re-auditing a modified deliverable for regressions. Does not activate for conversational responses, brainstorming, web search summaries, or rough drafts. compatibility: > claude.ai, Claude Code, Codex CLI, Gemini CLI (requires web search for research step; degrades gracefully without it) license: MIT
Harden
MANDATORY: Re-read this entire file before every audit pass. Do not rely on earlier readings within the same conversation. Context window drift causes skipped steps.
Purpose
Ensures substantial deliverables are structurally sound, internally consistent, and aligned with domain best practices before delivery. Replaces ad hoc quality checks with a repeatable methodology.
How It Works
- Research — Searches the web for domain-specific best practices before auditing (not just internal assumptions)
- Dual-pass audit — Checks both ideas (are they sound?) and structure (do paths/references/formatting work?)
- Severity classification — 🔴 Critical / 🟡 Significant / 🟢 Minor with specific locations
- Fix — Implements all critical and significant fixes (not just reports)
- Converge — Re-audits until only minor issues remain (max 3 passes)
The Process
Step 1: Research
Search for current best practices specific to the deliverable's domain.
Scope guidance:
- Minimum 2 searches for focused domains (e.g., a skill file -> Anthropic skill docs)
- 3-5 searches for broad or unfamiliar domains (e.g., a grant aims page -> NIH guidelines, successful examples, common reviewer complaints)
- Always check for official/canonical sources first, then practitioner insights
- Fetch the canonical source. If search results include the single most authoritative document for this domain (official guide, spec, standard), fetch and read it — do not rely on search snippets. Snippets capture ~5% of a document's content and miss testing, edge cases, and patterns. Limit to the first ~5000 tokens if the source is very long; prioritize sections on requirements, validation, and common mistakes.
- Synthesize findings into actionable bullets that inform the audit — not a reading list
If web search is unavailable: State this limitation explicitly. Proceed using training knowledge but flag that the audit lacks external validation.
Step 2: Comparative Audit
Run TWO distinct passes within this step. Both are mandatory.
Pass A — Thematic audit (ideas level):
- Does the deliverable achieve its stated purpose?
- Are the ideas sound, complete, and non-contradictory?
- Does it align with research findings from Step 1?
- Are there gaps, unstated assumptions, or logical errors?
- Would the target audience (reviewer, agent, collaborator) find it clear?
Pass B — Structural audit (artifact level):
- Do all paths, filenames, and references resolve?
- Are names, terms, and conventions consistent throughout?
- Do cross-references point to real sections/files?
- Is formatting correct for the target format (markdown, YAML, code)?
- Are there orphaned sections, redundant content, or dangling references?
- If executable (a plan, a config): would a literal reader succeed?
Pass C — Regression check (only when re-hardening a modified deliverable):
Trigger detection: Apply Pass C when the deliverable was previously completed or delivered and is now being revised — indicated by the user referencing a prior version, requesting modifications to existing work, or re-submitting after changes. Examples: "update the roadmap I wrote earlier", "fix the issues you found", or any edit to an artifact that was previously declared done.
- What previously worked that this modification could break?
- Are existing cross-references still valid after the change?
- Do acceptance criteria from the original still hold?
- Has scope crept beyond the intended modification?
Severity classification for every issue found:
- 🔴 Critical — Will cause failure, incorrect behavior, or fundamental misunderstanding. Must fix.
- 🟡 Significant — Degrades quality, violates best practices, causes confusion, or wastes resources (tokens, time). Must fix.
- 🟢 Minor — Cosmetic, low-impact, stylistic. Fix if easy, skip if not.
Required output structure:
- What passes — Sections/aspects reviewed and confirmed correct. This proves the audit actually examined these areas rather than skipping them.
- What's missing — Gaps, absent sections, unstated assumptions.
- What's wrong — Errors, inconsistencies, violations. Each with severity and specific location (line number, section name, or quote).
- Prioritized fix list — All issues ordered 🔴 → 🟡 → 🟢 with specific remediation for each.
Step 3: Update
Execute all 🔴 and 🟡 fixes. Do not cherry-pick. Do not defer significant issues. Fix 🟢 issues if they're adjacent to other changes.
After fixes, run a mechanical verification:
- If the deliverable contains paths: grep/check all paths resolve
- If it contains names/terms: verify consistency
- If it references sections: confirm they exist
- If it has acceptance criteria: verify the artifact satisfies them
Step 4: Diminishing Returns Check
Evaluate what this pass found:
- Any 🔴 critical issues? → Run another pass (return to Step 2)
- Any structural 🟡 issues? → Run another pass (return to Step 2)
- Only 🟢 minor issues? → Stop. Deliver.
Limits: 2 passes is typical. 3 is maximum. If still finding critical issues after 3 passes, flag to the user that the deliverable may need fundamental restructuring rather than incremental fixes.
Re-hardening trigger: When external standards change (new Anthropic docs, updated NIH guidelines, framework version bumps), previously hardened deliverables should be re-audited against the new standards. The user triggers this manually; the skill treats it as a fresh hardening with Pass C enabled.
Domain Quality Criteria
When auditing, identify the relevant quality criteria for the deliverable type. For domains not listed below, generate criteria by asking: "What would an expert reviewer check before approving this?"
Skills (Claude Code / Claude Chat)
- Frontmatter: name (lowercase, hyphens, <=64 chars; gerund form is conventional but not required), description (third person, non-empty, all triggers included; Anthropic Help Center states <=200 chars, ecosystem convention <=1024 chars — test auto-invocation if exceeding 200)
- If the skill has environment dependencies (web search, filesystem, MCP,
specific platform), declare them in the
compatibilityfrontmatter field AND mention the primary dependency in the description - Body under 500 lines, context-efficient (no content Claude already knows)
- Progressive disclosure where applicable
- Trigger validation: ask Claude "when would you use [skill name]?" and compare its answer against intended triggers. Check for under-triggering (missing expected phrases) and over-triggering (matching unrelated queries)
- Instructions are specific and actionable, not buried or ambiguous; critical steps appear early in the document
Setup/Configuration Documents
- All paths consistent and valid, instructions imperative
- Acceptance criteria explicit and mechanically verifiable
- Self-documenting with graceful fallback for missing prerequisites
Grant Sections
- Significance (clear gap), innovation (distinct from prior art), approach (feasible, risks mitigated), alignment (budget matches scope)
- Reviewer-friendly: scannable, no undefined jargon
Strategic Plans / Roadmaps
- Outcomes measurable, dependencies explicit, timeline realistic
- Risks identified with mitigations, prioritization rationale stated
Protocols / SOPs
- Steps unambiguous and executable by stated audience
- Materials specified with versions, decision points identified
- Expected results stated for verification
Implementation Plans / Execution Runbooks
- Every summary list (e.g., "Bootstrapping Steps") must have 1:1 correspondence with its detailed execution section. No step in one without a match in the other.
- For each step, verify preconditions are satisfied by prior steps. Flag: config created before its target exists, files referenced before creation, tools invoked before dependencies are available.
- Every CLI command, slash command, skill name, and tool referenced must actually exist. Verify against the known environment — not assumptions or recall. Flag speculative or hallucinated commands.
- Hook/automation rules must be achievable with their mechanism type (command hooks = grep/regex, prompt hooks = LLM yes/no, agent hooks = multi-turn). Flag rules requiring capabilities beyond their type.
- Every task must include a concrete verification method: a command with expected output, a file/state existence check, or an observable result. "Verify it works" and "run tests" without specifics are insufficient — a literal executor must be able to confirm success without follow-up.
- Verification methods must be falsifiable — they must be able to fail. "Check that the file exists" is falsifiable. "Review the output" is not.
Anti-Patterns
Performative skepticism: Adding self-critical commentary ("you might want to strengthen this") without running the actual audit process. Sounds rigorous, catches nothing structural.
Thematic-only auditing: Checking that ideas are sound while ignoring paths, names, formatting, and consistency. How 14 issues survived a "validated" deliverable in prior experience.
Structural-only auditing: The inverse — checking paths and formatting while missing that the ideas are wrong or incomplete.
Severity inflation: Marking everything 🔴 to appear thorough. Reserve critical for actual failure modes.
Audit without research: Skipping Step 1 and auditing against internal assumptions rather than external standards. The research step imports criteria that wouldn't be generated otherwise.
Snippet-depth research: Finding the canonical source but reading only search result excerpts instead of fetching the full document. Captures technical requirements but misses testing methodology, edge cases, and behavioral patterns. How 4 relevant quality criteria were missed despite the source appearing twice in search results.
Skipping "what passes": Only reporting failures without confirming what was reviewed and found correct. Hides gaps in audit coverage.
Static-only plan auditing: Checking that references resolve and formatting is correct while treating a sequential plan as a static document. Plans are executable — each step has preconditions. Auditing a plan without simulating execution order misses ordering violations, summary↔detail drift, and nonexistent tool references. How 7 issues (4 moderate) survived 4 audit passes on an implementation plan.
Verification-free planning: Approving an implementation plan where tasks lack concrete verification methods. Every task produces a result; every result can be checked. Plans without per-task verification shift the burden to the executor to invent checks — which means they often don't.