name: kubernetes-skill description: "Prevent Kubernetes hallucinations by diagnosing and fixing failure modes: insecure workload defaults, resource starvation, network exposure, privilege sprawl, fragile rollouts, and API drift. Use when generating, reviewing, refactoring, or migrating manifests, Helm charts, Kustomize overlays, cluster policies, and platform-specific Kubernetes work for EKS, GKE, AKS, OpenShift, GitOps controllers, or observability stacks."
KubeShark: Failure-Mode Workflow for Kubernetes
Run this workflow top to bottom.
1) Capture execution context
Record before writing manifests:
- cluster version (e.g. 1.30, 1.31) and distribution (EKS, GKE, AKS, k3s, vanilla)
- target namespace and environment criticality (dev/staging/prod)
- workload type (Deployment, StatefulSet, Job, CronJob, DaemonSet)
- deployment method (raw YAML, Helm, Kustomize, operator-managed)
- policy enforcement (Pod Security Admission level, Kyverno, OPA/Gatekeeper)
- cloud provider and CNI (affects networking, storage classes, load balancers)
- platform controllers/add-ons (GitOps, observability, ingress, service mesh, autoscaling)
If unknown, state assumptions explicitly.
2) Diagnose likely failure mode(s)
Select one or more based on user intent and risk:
- insecure workload defaults: missing security contexts, PSS violations, host access
- resource starvation: missing requests/limits, no PDB, scheduling chaos
- network exposure: flat networking, missing policies, wrong Service types, DNS issues
- privilege sprawl: overly permissive RBAC, leaked secrets, excess ServiceAccount rights
- fragile rollouts: misconfigured probes, mutable tags, unsafe update strategies
- API drift: wrong apiVersion, deprecated APIs, schema violations, tool-specific errors
3) Load only the relevant reference file(s)
Primary failure-mode references:
references/insecure-workload-defaults.mdreferences/resource-starvation.mdreferences/network-exposure.mdreferences/privilege-sprawl.mdreferences/fragile-rollouts.mdreferences/api-drift.md
Supplemental references (only when needed):
references/deployment-patterns.mdreferences/stateful-patterns.mdreferences/job-patterns.mdreferences/daemonset-operator-patterns.mdreferences/security-hardening.mdreferences/observability.mdreferences/multi-tenancy.mdreferences/storage-and-state.mdreferences/helm-patterns.mdreferences/kustomize-patterns.mdreferences/validation-and-policy.mdreferences/examples-good.mdreferences/examples-bad.mdreferences/do-dont-patterns.md
Conditional Reference Retrieval (CRR) references (load only when the signal is detected):
references/conditional/eks-patterns.mdfor EKS, AWS, IRSA, EKS Pod Identity, AWS Load Balancer Controller, EBS/EFS CSI, Karpenterreferences/conditional/gke-patterns.mdfor GKE, Autopilot, Workload Identity Federation for GKE, Dataplane V2, GCE Ingress, Config Syncreferences/conditional/aks-patterns.mdfor AKS, Microsoft Entra Workload ID, Azure CNI, AGIC, Azure Disk/File/Blob CSIreferences/conditional/openshift-patterns.mdfor OpenShift, OKD, ROSA, ARO, Routes, SCCs, OLM,ocreferences/conditional/gitops-controllers.mdfor Argo CD, ApplicationSet, Flux, GitOps reconciliation, sync wavesreferences/conditional/observability-stacks.mdfor Prometheus Operator, ServiceMonitor, PodMonitor, OpenTelemetry, Loki, Grafana
Do not load multiple CRR files unless the task spans multiple detected platforms/tools.
4) Propose fix path with explicit risk controls
For each fix, include:
- why this addresses the failure mode
- what could still go wrong at deploy time or runtime
- guardrails (validation commands, policy checks, rollback path)
5) Generate implementation artifacts
When applicable, output:
- Kubernetes manifests (YAML with security contexts, resource limits, labels)
- Helm values/templates or Kustomize overlays
- NetworkPolicies, RBAC resources, PodDisruptionBudgets
- Policy rules (Kyverno/OPA) and admission controls
6) Validate before finalize
Always provide validation steps tailored to deployment method and risk tier:
kubectl apply --dry-run=serverorkubectl diffkubeconformfor schema validation against target cluster version- cross-resource consistency check (label/selector/port alignment)
- policy scan (PSS profile check, Kyverno/OPA audit) Never recommend direct production apply without reviewed diff and approval.
7) Output contract
Return:
- assumptions and cluster version floor
- selected failure mode(s)
- chosen remediation and tradeoffs
- validation/test plan
- rollback/recovery notes (rollout undo, revision history, data safety)