Guide for implementing HolmesGPT - an AI agent for troubleshooting cloud-native environments. Use when investigating Kubernetes issues, analyzing alerts from Prometheus/AlertManager/PagerDuty, performing root cause analysis, configuring HolmesGPT installations (CLI/Helm/Docker), setting up AI providers (OpenAI/Anthropic/Azure), creating custom toolsets, or integrating with observability platforms (Grafana, Loki, Tempo, DataDog).
name: holmesgpt-skill
description: Guide for implementing HolmesGPT - an AI agent for troubleshooting cloud-native environments. Use when investigating Kubernetes issues, analyzing alerts from Prometheus/AlertManager/PagerDuty, performing root cause analysis, configuring HolmesGPT installations (CLI/Helm/Docker), setting up AI providers (OpenAI/Anthropic/Azure), creating custom toolsets, or integrating with observability platforms (Grafana, Loki, Tempo, DataDog).
HolmesGPT Skill
AI-powered troubleshooting for Kubernetes and cloud-native environments.
Overview
HolmesGPT is a CNCF Sandbox project that connects AI models with live
observability data to investigate infrastructure problems, find root
causes, and suggest remediations. It operates with read-only access
and respects RBAC permissions, making it safe for production environments.
Quick Reference
Topic
Reference
Installation
references/installation.md
Configuration
references/configuration.md
Data Sources
references/data-sources.md
Commands
references/commands.md
Troubleshooting
references/troubleshooting.md
HTTP API
references/http-api.md
Integrations
references/integrations.md
Key Features
Root Cause Analysis: Investigates alerts and cluster issues
# custom-toolset.yaml
toolsets:
my-custom-tool:
description: "Custom diagnostic tool"
tools:
- name: check_service_health
description: "Check health of a specific service"
command: |
curl -s http://{{ service_name }}.{{ namespace }}.svc.cluster.local/health
parameters:
- name: service_name
description: "Name of the service"
- name: namespace
description: "Kubernetes namespace"
Use with: holmes ask "check health" -t custom-toolset.yaml
Kubernetes Annotations for Integration
# Add to Services/Deployments for HolmesGPT context
metadata:
annotations:
holmesgpt.dev/runbook: |
This service handles payment processing.
Common issues: database connectivity, API rate limits.
Check: kubectl logs -l app=payment-service
Environment Variables Reference
Variable
Description
Default
HOLMES_CONFIG_PATH
Config file path
~/.holmes/config.yaml
HOLMES_LOG_LEVEL
Log verbosity
INFO
PROMETHEUS_URL
Prometheus server URL
-
GITHUB_TOKEN
GitHub API token
-
DATADOG_API_KEY
DataDog API key
-
CONFLUENCE_BASE_URL
Confluence URL
-
Best Practices
Use Specific Queries: Include namespace, deployment name, symptoms
Start with Claude Sonnet 4.0/4.5: Best accuracy for complex investigations
Enable Relevant Toolsets: Only enable what you need to reduce noise
Use Interactive Mode: For complex multi-step investigations
Set Up Runbooks: Provide context for known alert types