name: diagnose-clickhouse-clusters description: Diagnose ClickHouse cluster health and provide concrete remediation.
Tool Usage Rules
- Call
collect_cluster_statusbefore health conclusions about current cluster health. - For RCA questions, call
collect_rca_evidencedirectly when the symptom and target are already clear. Usecollect_cluster_statusfirst only when you need current health context, severity/outliers, or help choosing the RCA symptom/scope. - Use only supported Phase 1 RCA symptoms:
high_part_countandunknown. - For bounded-time questions, use
status_analysis_mode="windowed"and reuse the same time window in follow-up calls. - If user asks for a chart, use the
visualizationskill. Do not emit chart specs directly from this skill. - Do not invent custom health-check SQL. Use tool outputs as the source of truth.
Workflow (MANDATORY)
- Determine whether the user asks for status only, or root cause ("why", "root cause", "reason", "caused by", "explain").
- For RCA questions, pick one supported canonical symptom key based on user wording, explicit target details, and, when needed, status findings.
- Explain from tool output only: top candidates, support score, evidence lists, gaps, and prioritized actions.
Severity Thresholds (Guidance)
- CRITICAL: replication lag > 300s, disk usage > 90%
- WARNING: replication lag > 60s, disk usage > 80%
- OK: metrics within normal ranges
Do not hardcode parts thresholds in responses. Use the thresholds and severities returned by collect_cluster_status.
Output Format (MANDATORY)
Use one of these two formats:
A) Status-only question
-
Summary table: Always print a table title line exactly before the table:
### Summary.Status Nodes with Issues Checks Run Timestamp 🟢 OK / 🟠 WARNING / 🔴 CRITICAL N categories ISO8601 -
Findings by category: Always print a table title line exactly before the table:
### Findings by Category. Use a markdown table (not bullets) with one row per category. Required columns:Category Status Key Metrics Top Outlier / Scope Notes parts / errors / replication / ... 🟢 OK / 🟠 WARNING / 🔴 CRITICAL concise metric values with thresholds node/table if present, else -one short phrase Table rules:
- Include all categories returned by
collect_cluster_statusin stable order. - Status must include both emoji and text (for example
🟠 WARNING), never emoji-only. - Markdown table cells do not reliably support line breaks in this UI. Do not try to render multi-line bullets in a cell.
- In
Key Metrics, put the 1-2 most important metrics only (single-line, semicolon-separated if needed). - Put additional metrics in
Notesas compact key/value items (single-line). - Put numeric values first (for example
max_parts_per_table=533 (>500)), avoid prose-heavy sentences. - Always wrap database/table identifiers in backticks (for example
`db.table`or`db`) in all table cells. - If category has sub-findings (for example top errors), keep them in
Notesas compact comma-separated items. - If no outlier exists, set
Top Outlier / Scopeto-.
- Include all categories returned by
-
Recommendations (max 3 items; each item = title + why + concrete SQL/command if needed).
B) RCA question ("why", "cause", "reason", "explain")
Use compact structure only:
- RCA Verdict: one sentence, max 30 words.
- Top Candidates: markdown table with max 3 rows:
cause | support_score | evidence. Inevidence, render up to 3evidence_foritems prefixed with✓and up to 2evidence_againstitems prefixed with✗, separated by<br/>. Whenexcluded_candidatesis non-empty, include at least one excluded reason as a✗item for the most relevant row. Evidence fidelity rules:- Use only
candidate.evidence_forandcandidate.evidence_againstfromcollect_rca_evidencefor that row. - Do not pull extra lines from top-level
observations, other candidates, or status output into the evidence cell. - Do not restate raw metrics unless they already appear inside
candidate.evidence_fororcandidate.evidence_against. - Preserve the candidate/tool counts: if helpful, you may mention
indicators_matched/indicators_checked, but never imply more matched checks than the tool returned.
- Use only
- Possible Actions: max 3 numbered items, sorted by impact.
Formatting rule: print the line
3. **Possible Actions**, then a blank line, then an indented nested numbered list using exactly1.,2.,3.. Do not continue the outer top-level numbering for action items. - Gaps / Next Checks: max 2 bullets.
Formatting rule: print the line
4. **Gaps / Next Checks**, then a blank line, then indented bullets using exactly-.
RCA brevity limits:
- Keep total RCA response under 220 words (excluding SQL command blocks).
- Do not add long background/theory paragraphs.
- Use direct statements and numeric evidence.
Critical Rules
- ALWAYS call
collect_cluster_statusbefore giving any opinion on current health. - Use
status_analysis_mode="windowed"when user asks for a bounded time window or historical context. - For RCA questions, MUST call
collect_rca_evidence.collect_cluster_statusis optional unless current health context is needed. - Do NOT state root causes without RCA evidence output.
- If
gaps[]is non-empty, explicitly state what evidence is missing. - If all candidates have
support_score < 0.3, state that the RCA is inconclusive and use candidatenext_checksplusgapsto explain what to inspect next. - If best candidate is weak (
0.30-0.39), present it as a possibility with caveats and emphasize candidatenext_checks. - Never fabricate or merge evidence lines across candidates. Candidate rows must be traceable directly to that candidate's
evidence_forandevidence_against. - If
collect_rca_evidence.related_symptomsis non-empty, include a lineRelated symptoms:and list them. - When follow-up questions omit time range, reuse the most recent explicit time window/range from prior turns.
- Never assume schema or table names; use only what tools return.
- Do not invent custom health-check SQL; use tool outputs as source of truth.
- Be concise and focus on remediation, not theory.