Agent Fleet Status
Check the health of multiple machines via SSH. Get a unified dashboard for distributed systems.
Description
Agent Fleet Status monitors the health of a fleet of machines — servers, Mac Minis, cloud instances, Raspberry Pis, or any SSH-accessible host. It checks uptime, disk, memory, CPU, running processes, cron jobs, and service health, then outputs a unified dashboard. Built for operators running distributed agent swarms, home labs, or multi-server deployments.
Activation
This skill activates when:
- The user asks about the status of their servers or machines
- The user wants to check if a service is running on a remote host
- The user mentions fleet health, server status, or system monitoring
- The user asks "are my agents running?" or "what's the status of my machines?"
Trigger phrases: "fleet status", "check my servers", "machine health", "are my agents running", "server dashboard", "system status", "check uptime", "what's running on"
Health Check Dimensions
1. Connectivity
- Can we reach the host via SSH?
- Latency (ms)
- Last successful connection
2. System Resources
| Check | Warning | Critical |
|---|---|---|
| Disk usage | >80% | >95% |
| Memory usage | >85% | >95% |
| CPU load (5min avg) | >70% | >90% |
| Uptime | <1 day (recent reboot) | N/A |
| Swap usage | >50% | >80% |
3. Process Health
- Are expected processes running? (by name or PID file)
- Zombie processes count
- Top 5 processes by CPU/memory
4. Service Health
- HTTP endpoints responding? (status code + latency)
- Port checks (is the port open and accepting connections?)
- Cron job status (last run time, exit codes from logs)
5. Agent-Specific
- OpenClaw gateway status
- Ollama model loaded
- Custom agent processes
- Log file freshness (is the agent producing output?)
Instructions
When asked to check fleet status, generate commands and/or output in this format:
# Fleet Status Dashboard
## Checked at: [timestamp]
---
### [HOSTNAME] — [IP] — [STATUS EMOJI removed: OK / WARN / CRITICAL / OFFLINE]
| Check | Value | Status |
|-------|-------|--------|
| SSH | Connected (XXms) | OK |
| Uptime | XX days | OK |
| Disk | XX% of XXG used | [OK/WARN/CRIT] |
| Memory | XX% of XXG used | [OK/WARN/CRIT] |
| CPU Load | X.XX (5m avg) | [OK/WARN/CRIT] |
| Swap | XX% used | [OK/WARN/CRIT] |
**Running Services**:
- [service]: PID XXXX, up Xh, CPU X%, MEM X%
- [service]: PID XXXX, up Xh, CPU X%, MEM X%
**Cron Jobs**: XX active, last run [time], [X failures in 24h]
**Alerts**:
- [Any warnings or critical issues]
---
### Fleet Summary
| Host | Status | Disk | Mem | CPU | Services | Alerts |
|------|--------|------|-----|-----|----------|--------|
| [host1] | OK | 45% | 62% | 0.8 | 5/5 | 0 |
| [host2] | WARN | 82% | 71% | 1.2 | 4/5 | 1 |
| [host3] | OFFLINE | — | — | — | — | SSH FAIL |
SSH Commands Used
The skill generates and executes these commands per host:
# Connectivity
ssh -o ConnectTimeout=5 -o BatchMode=yes user@host "echo ok"
# System resources
ssh user@host "
uptime;
df -h / | tail -1;
free -m | grep Mem;
cat /proc/loadavg;
swapon --show --bytes 2>/dev/null
"
# Process health
ssh user@host "
ps aux --sort=-%mem | head -6;
ps aux | grep -c Z # zombie count
"
# Service checks
ssh user@host "
pgrep -la ollama;
pgrep -la openclaw;
pgrep -la node;
systemctl is-active [service] 2>/dev/null
"
# Cron status
ssh user@host "crontab -l 2>/dev/null | grep -v '^#' | wc -l"
# Log freshness
ssh user@host "find /var/log -name '*.log' -mmin -60 | head -5"
For macOS hosts, adjust commands:
# macOS disk
ssh user@host "df -h / | tail -1"
# macOS memory (no free command)
ssh user@host "vm_stat | head -5; sysctl hw.memsize"
# macOS processes
ssh user@host "ps aux -r | head -6"
Fleet Configuration
Define your fleet as a simple list:
fleet:
- name: Omni
host: localhost
type: linux
expected_services: [ollama, node]
- name: BMO
host: 192.168.1.98
user: operator
type: macos
expected_services: [openclaw, ollama, node, n8n]
- name: OCI
host: 192.168.1.92
user: macmini
type: macos
expected_services: [openclaw]
- name: SailorsBot1
host: 192.168.1.99
user: operator
type: macos
expected_services: [repflow]
Alerting Logic
Severity escalation:
- INFO: All systems green. No action needed.
- WARN: One or more hosts have warnings (disk >80%, high memory, service restart detected). Review within 24 hours.
- CRITICAL: A host has critical resource usage or a key service is down. Act within 1 hour.
- OFFLINE: A host is unreachable via SSH. Investigate immediately — could be network, crash, or power.
Example
Input
Check status of BMO (192.168.1.98, operator, macOS) and OCI (192.168.1.92, macmini, macOS).
Output
(Generates SSH commands, executes them, and produces the dashboard table showing both machines' disk, memory, CPU, running services, cron job count, and any alerts.)
Built by KOINO Capital — Agentic growth systems that run while you sleep. Want this running autonomously 24/7? Deploy with KOINO