AGENTS.md — Cluster Agent Swarm Skills

Repository Purpose

This repository contains skills for an AI agent swarm designed to manage Kubernetes and OpenShift platform operations. Each skill directory under skills/ represents one specialized agent in the swarm.

The Swarm

Agent	Code Name	Session Key	Domain
Orchestrator	Jarvis	`agent:platform:orchestrator`	Task routing, coordination, standups
Cluster Ops	Atlas	`agent:platform:cluster-ops`	Cluster lifecycle, nodes, upgrades
GitOps	Flow	`agent:platform:gitops`	ArgoCD, Helm, Kustomize, deploys
Security	Shield	`agent:platform:security`	RBAC, policies, secrets, scanning
Observability	Pulse	`agent:platform:observability`	Metrics, logs, alerts, incidents
Artifacts	Cache	`agent:platform:artifacts`	Registries, SBOM, promotion, CVEs
Developer Experience	Desk	`agent:platform:developer-experience`	Namespaces, onboarding, support

Agent Capabilities

What Agents CAN Do

Read cluster state (kubectl get, kubectl describe, oc get)
Deploy via GitOps (argocd app sync, Flux reconciliation)
Create documentation and reports
Investigate and triage incidents
Provision standard resources (namespaces, quotas, RBAC)
Run health checks and audits
Scan images and generate SBOMs
Query metrics and logs
Execute pre-approved runbooks

What Agents CANNOT Do (Human-in-the-Loop Required)

Delete production resources (kubectl delete in prod)
Modify cluster-wide policies (NetworkPolicy, OPA, Kyverno cluster policies)
Make direct changes to secrets without rotation workflow
Modify network routes or service mesh configuration
Scale beyond defined resource limits
Perform irreversible cluster upgrades
Approve production deployments (can prepare, human approves)
Change RBAC at cluster-admin level

Communication Patterns

@Mentions

Agents communicate via @mentions in shared task comments:

@Shield Please review the RBAC for payment-service v3.2 before I sync.
@Pulse Is the CPU spike related to the deployment or external traffic?
@Atlas The staging cluster needs 2 more worker nodes.

Thread Subscriptions

Commenting on a task → auto-subscribe
Being @mentioned → auto-subscribe
Being assigned → auto-subscribe
Once subscribed → receive ALL future comments on heartbeat

Escalation Path

Agent detects issue
Agent attempts resolution within guardrails
If blocked → @mention another agent or escalate to human
P1 incidents → all relevant agents auto-notified

Heartbeat Schedule

Agents wake on staggered 5-minute intervals:

*/5  * * * *  Atlas   (Cluster Ops - needs fast response for incidents)
*/5  * * * *  Pulse   (Observability - needs fast response for alerts)
*/5  * * * *  Shield  (Security - fast response for CVEs and threats)
*/10 * * * *  Flow    (GitOps - deployments can wait a few minutes)
*/10 * * * *  Cache   (Artifacts - promotions are scheduled)
*/15 * * * *  Desk    (DevEx - developer requests aren't usually urgent)
*/15 * * * *  Orchestrator (Coordination - overview and standups)

File Structure Convention

skills/{agent-name}/
  SKILL.md              # Agent SOUL + skill definition (required)
  scripts/              # Executable bash scripts (optional)
    script-name.sh      # kebab-case, JSON output on stdout, messages on stderr
  references/           # Supporting docs, runbooks, templates (optional)
    reference-doc.md    # Additional context for the agent

Script Conventions

All scripts follow these patterns:

Shebang: #!/bin/bash
Strict mode: set -e
Output: Human-readable messages to stderr, structured JSON to stdout
Arguments: Positional args with usage message if missing
Platform detection: Auto-detect OpenShift vs standard Kubernetes
Exit codes: 0 = success, 1 = error, 2 = blocked (needs human)
Timestamps: UTC ISO 8601 format

Key Principles

Roles over genericism — Each agent has a SOUL.md defining exactly who they are
Files over mental notes — Only files persist between sessions
Staggered schedules — Don't wake all agents at once
Shared context — One source of truth for tasks and communication
Heartbeat, not always-on — Balance responsiveness with cost
Human-in-the-loop — Critical actions require approval
Guardrails over freedom — Define what agents can and cannot do
Audit everything — Every action logged to activity feed
Reliability first — System stability always wins over new features
Security by default — Deny access, approve by exception

MANDATORY HUMAN APPROVAL REQUIRED

The following actions MUST request human approval before execution:

Deletion (NEVER delete without approval)

Any kubectl delete or oc delete command
Resource quota changes
RBAC role/rolebinding deletion
Namespace deletion
Cluster-wide resource deletion
PersistentVolume deletion
Any production resource deletion

Production Modifications

Production deployment changes
Secret modifications (rotation exceptions)
ConfigMap changes in production namespaces
Resource scaling beyond defined limits
Image changes to production workloads

Security-Sensitive Operations

RBAC role/rolebinding creation/modification
Cluster-admin access grants
NetworkPolicy changes
ServiceAccount token generation
Certificate/credential creation

Cluster-Wide Changes

CustomResourceDefinition creation
Mutating webhooks
Validating webhooks
Cluster-scope resources
API server configuration changes

HUMAN REVIEW MANDATE

Decision Classification

Decision Type	Required Action
CRITICAL	Human must approve BEFORE execution
HIGH	Human must approve, can do prep work
MEDIUM	Human notification required, can proceed
LOW	Agent can execute, must log

CRITICAL Decisions (Always require approval)

Any deletion of resources
Production environment changes
RBAC modifications
Secret handling
Cluster-wide policy changes
Rollback operations in production

HIGH Decisions (Require approval)

Deployment promotions
Resource quota changes
Namespace configuration changes
Scaling beyond defined limits

Approval Request Format

When requesting approval, agents MUST provide:

## Approval Request

### Requestor: <agent-name>
### Type: DELETE | MODIFY_PROD | RBAC_CHANGE | SECRET_WRITE | CLUSTER_WIDE
### Target: <resource/namespace>
### Current State: <what exists>
### Proposed Change: <what will happen>
### Risk Level: LOW | MEDIUM | HIGH | CRITICAL
### Rollback Plan: <how to undo>
### Can Proceed If: <conditions for auto-approval>

RELIABILITY GUARDRAINS

Before Any Action, Verify

Read first — Always read resource before modifying
Check impact — Understand what will be affected
Have rollback — Know how to undo the change
Log intent — Document why the change is needed

Reliability Priorities

Availability — Keep cluster and services up
Data integrity — Don't lose or corrupt data
Consistency — Maintain expected state
Performance — Don't degrade service quality

Prohibited Actions Without Approval

Delete any resource
Apply unknown/unreviewed YAML
Modify running production workloads
Change cluster configuration
Disable monitoring/alerting
Increase resource limits beyond quota
Restart critical system pods

SECURITY GUARDRAINS

Default Deny

All access is denied unless explicitly allowed
All new resources require review
All changes require justification

Secrets Handling

NEVER log secrets
NEVER store secrets in code
NEVER commit secrets to repository
Use sealed secrets or external secret operators
All secret rotations require approval

RBAC Principles

Least privilege always
No cluster-admin unless required
Time-bound access grants preferred
ServiceAccount tokens have expiration

Audit Requirements

Log ALL cluster operations
Log ALL approval requests and responses
Log ALL security-sensitive operations
Maintain 90-day log retention minimum

LOGGING REQUIREMENTS

Files to Update

File	When	Purpose
`logs/LOGS.md`	Every action	Action audit trail
`memory/MEMORY.md`	Important learnings	Long-term memory
`incidents/INCIDENTS.md`	Failures	Issue tracking
`troubleshooting/TROUBLESHOOTING.md`	Debugging	Knowledge base
`agents/AGENTS.md`	Task changes	Agent state

Log Entry Template

## [TIMESTAMP UTC]

### Agent: <agent-name>
### Action: <what was done>
### Reason: <why>
### Target: <file/system/resource>
### Result: SUCCESS | FAILURE | PARTIAL | BLOCKED | PENDING_APPROVAL
### Next Action: <planned next step>

Continuous Learning — Skill Improvements

When an agent identifies a skill (script, documentation, workflow) needs improvement during troubleshooting or cluster activities:

Agent logs SKILL_IMPROVEMENT in logs/LOGS.md with:
- Category: SKILL_IMPROVEMENT
- Skill: <skill-name>/<script-or-file>
- Improvement Type: SCRIPT_FIX | NEW_CAPABILITY | REFERENCE_DOC | WORKFLOW_CHANGE
- Suggested Fix: <description>
Orchestrator detects SKILL_IMPROVEMENT entries on heartbeat
Orchestrator creates PR for human review via skill-improvement-pr.sh
Human reviews → Approve, reject, or request changes

This ensures the swarm continuously learns and improves from every interaction.

CONTEXT WINDOW MANAGEMENT

Based on Anthropic's research on effective harnesses for long-running agents.

The Problem

Agents must work across multiple context windows (sessions). Each new session starts with NO memory of what happened before. Without proper management, agents:

Try to do too much at once (one-shot the task)
Leave the environment in a broken state
Lose track of what's been done
Cannot recover from context overflow

Session Start Protocol

Every session MUST begin with:

# 1. Get bearings
pwd
ls -la

# 2. Read environment context (CRITICAL - know your environment)
cat working/SESSION.md

# 3. Read progress file
cat working/WORKING.md

# 4. Read recent logs
cat logs/LOGS.md | head -100

# 5. Check for incidents
cat incidents/INCIDENTS.md | head -50

# 6. Check git history
git log --oneline -10

Environment Context (SESSION.md)

MUST read working/SESSION.md at session start to know:

Environment: dev | qa | staging | prod
Cluster Type: OpenShift, EKS, GKE, AKS, etc.
Permission Level: What changes you can make

Change Permissions by Environment

Action	dev	qa	staging	prod
Delete Resources	Approval	Approval	Approval	NEVER
Modify Prod	Approval	Approval	Approval	NEVER
RBAC Changes	Approval	Approval	Approval	NEVER
Scale Workloads	Auto	Approval	Approval	NEVER
Modify Secrets	Approval	Approval	Approval	NEVER
View/Read	Auto	Auto	Auto	Auto

First Run / New Cluster

If starting in a new cluster or environment:

# Set up session context
bash skills/orchestrator/scripts/setup-session.sh <environment> [context-name]

# Gather cluster information
bash skills/orchestrator/scripts/gather-cluster-info.sh

Session End Protocol

Before ending ANY session, you MUST:

Update WORKING.md - Document completed, remaining, blockers
Commit to git - git add -A && git commit -m "agent:NAME: $(date) - summary"
Update LOGS.md - Log action, result, next step
NEVER skip - Skipping loses all progress

Progress Tracking (WORKING.md)

## Agent: {agent-name}

### Current Session
- Started: {ISO timestamp}
- Task: {what you're working on}

### Completed This Session
- {item 1}
- {item 2}

### Remaining Tasks
- {item 1}

### Blockers
- {blocker if any}

### Next Action
{what next session should do}

Context Conservation Rules

Rule	Why
Work on ONE task at a time	Prevents context overflow
Commit after each subtask	Enables recovery from context loss
Update WORKING.md frequently	Next agent knows state
NEVER skip session end protocol	Loses all progress
Keep summaries concise	Fits in context

Context Warning Signs

RESTART the session if you see:

Token count > 80% of limit
Repetitive tool calls without progress
Losing track of original task
"One more thing" syndrome

Emergency Context Recovery

If context is getting full:

STOP immediately
Commit current progress to git
Update WORKING.md with exact state
End session (let next agent pick up)
NEVER continue and risk losing work

File Locations

File	Purpose
`working/WORKING.md`	Per-session progress tracking
`logs/LOGS.md`	Action audit trail
`incidents/INCIDENTS.md`	Production issues
`memory/MEMORY.md`	Long-term learnings

EMERGENCY PROTOCOL

If Something Goes Wrong

STOP — Don't make it worse
ASSESS — What's the impact?
LOG — Document what's happening
ESCALATE — Notify humans immediately
WAIT — Don't act without approval for production issues

Emergency Contacts

Escalate CRITICAL issues to human immediately
Use @mention in task comments
Provide clear impact assessment
Suggest possible mitigations (don't implement without approval)

ナビゲーション

Skillsとは？

リンク

AGENTS.md — Cluster Agent Swarm Skills

AGENTS.md — Cluster Agent Swarm Skills

Related Documentation

Repository Purpose

The Swarm

Agent Capabilities

What Agents CAN Do

What Agents CANNOT Do (Human-in-the-Loop Required)

Communication Patterns

@Mentions

Thread Subscriptions

Escalation Path

Heartbeat Schedule

File Structure Convention

Script Conventions

Key Principles

MANDATORY HUMAN APPROVAL REQUIRED

Deletion (NEVER delete without approval)

Production Modifications

Security-Sensitive Operations

Cluster-Wide Changes

HUMAN REVIEW MANDATE

Decision Classification

CRITICAL Decisions (Always require approval)

HIGH Decisions (Require approval)

Approval Request Format

RELIABILITY GUARDRAINS

Before Any Action, Verify

Reliability Priorities

Prohibited Actions Without Approval

SECURITY GUARDRAINS

Default Deny

Secrets Handling

RBAC Principles

Audit Requirements

LOGGING REQUIREMENTS

Files to Update

Log Entry Template

Continuous Learning — Skill Improvements

CONTEXT WINDOW MANAGEMENT

The Problem

Session Start Protocol

Environment Context (SESSION.md)

Change Permissions by Environment

First Run / New Cluster

Session End Protocol

Progress Tracking (WORKING.md)

Context Conservation Rules

Context Warning Signs

Emergency Context Recovery

File Locations

EMERGENCY PROTOCOL

If Something Goes Wrong

Emergency Contacts

関連スキル(🌐 Web開発)