name: regression-e2e description: Triage frame for end-to-end (E2E) test, smoke test, regression, and fault-injection validation failures. Use when the user reports an E2E/smoke/regression run is broken, a fault-injection flow fails mid-pipeline, a previously-green test is now flaky, or asks to "triage", "debug the E2E", "why is the regression failing", "help me figure out what broke the smoke test". Produces a short hypothesis list and next inspection surfaces rather than a full runbook. Trigger words — e2e, end-to-end, regression, smoke test, fault injection validation, flaky test, test broke.

Regression / E2E Triage

See also: the aegisctl skill for general CLI composition (NDJSON streaming, name-not-id filters). Specific aegisctl … invocations below are illustrative — aegisctl <noun> [verb] --help is the source of truth and supersedes anything that drifts here.

Provide a short triage frame for E2E failures. Focus on likely problem classes and next inspection directions rather than detailed runbooks.

Triage Order

Confirm the test contract still matches product behavior.
- Stale expected terminal events, outdated fixtures, renamed statuses, old snapshots.
Confirm the environment is actually ready.
- Service health, cluster dependencies, seeded config, storage, queues, required ports.
Confirm version alignment across the stack.
- Local CLI / test harness, deployed backend, worker images, schemas, container tags — look for drift.
Confirm auth and config wiring.
- Tokens, permissions, endpoints, namespaces, feature flags, project/tenant resolution.
Confirm rerun semantics.
- Dedupe, idempotency, cleanup, leftover state that blocks identical reruns.
Confirm the data path completes.
- Fault creation, task execution, artifact generation, uploads, persistence, downstream reads.
Confirm observability is sufficient.
- Follow one trace / job / execution end-to-end across logs, events, and DB state.

Common Problem Buckets

Stale contract — product is healthy, but the test asserts an old event, payload, or sequence.
Environment drift — docs, fixtures, or assumptions no longer match the real local or CI environment.
Version drift — wrong binary/image is running, or a remote pull overrides the intended local build.
Auth or config mismatch — the flow starts, but later steps fail because required creds, flags, or routing are missing.
Timing / race conditions — async consumers, startup latency, or eventual consistency make the test flaky.
Replay / dedupe behavior — a new run is suppressed because an older valid run or cached state still exists.
Missing result data — workflow appears complete, but artifacts, uploads, or DB rows are incomplete/absent.
Weak observability — user-facing error is generic; the real blocker is only visible in trace or runtime logs.
Selector/label mismatch from operators — app-label-based selectors return a partial set of services because operator-generated pods (Coherence, Kafka, Postgres controllers) carry different labels than plain Deployments. Pattern to watch: "everything installs healthy, but the selector only sees N of M services."
Bootstrap-edge gaps — seed path populates some stores (DB, configmap, etc.) but not the one the runtime actually reads on first boot. Symptom: the data is "there" somewhere, but the process that needs it gets not found the first time and works after a manual reconciliation.
Duplicate-submission suppression — aegisctl regression silently refuses re-submitting the same (app, namespace, chaos_type, duration) within a cooldown; symptom is duplicate submission suppressed (batches [0]). Vary spec[].app or duration to force a fresh hash.
App name filtered by serviceendpoints — backend pre-filters live pod labels against the serviceendpoints metadata store before returning them to guided-resolve. A pod WITH the label (e.g. currency in otel-demo) still errors app ... not found; available apps: accounting, ad, cart, ... if it isn't in the endpoint map. Pick an app that IS in the "available apps" list the error prints.
Empty parquet with .parquet suffix required — RCABENCH_OPTIONAL_EMPTY_PARQUETS values match on the full filename INCLUDING .parquet. Leaving the extension off silently fails to opt in.
Namespace landed on the wrong slot — submit asked for <sys>14 but the actual fault ran in <sys>0. On a current backend this should fail loudly instead; if it doesn't, the running binary is too old. Rebuild and redeploy. As a workaround, re-submit via aegisctl inject guided --apply --auto and let the server pick a slot.
Single HTTP injection rejected as a duplicate — submit returns service 'X' at positions 0 and 0 warning and zero items. Self-loops (e.g. GET / against the same app) used to trip this. On a current backend it's gone; if it appears, the backend predates the fix or the same regression has resurfaced. Workaround: change the route or method to break the self-loop.
Auto-allocate fails with "pool exhausted" — every namespace in the system's pool is either locked or empty. Two clean fixes: pass --allow-bootstrap so the server extends the pool with a fresh slot, or pre-deploy a workload via aegisctl inject guided --install --namespace <sys>N to give an existing slot something to inject against.
Asked for a specific namespace, got "not found in current configuration" — that namespace isn't registered in the system's pool yet. Easiest fix is to re-submit through aegisctl inject guided --apply --namespace <sys>N (current backends register it on submit). If that doesn't help, the namespace was created out-of-band; expand the system's count via the systems admin API and retry.
Silent trace drop, one-system-only — one benchmark (typically ts* or tea*) shows zero spans in ClickHouse for the last hour while peer systems flow normally. Root cause is almost always the OTel deployment-collector HPA pegged at maxReplicas with memory_limiter returning UNAVAILABLE: data refused due to high memory usage. Java agents (TT, teastore) don't retry on UNAVAILABLE — spans dropped on the floor. Go / Node SDKs use BatchSpanProcessor + retry, so they squeeze through. Check kubectl -n monitoring get hpa opentelemetry-kube-stack-deployment-collector: if REPLICAS = MAXPODS and memory > 100% of target, raise autoscaler.maxReplicas and resources.limits.memory in AegisLab/manifests/byte-cluster/otel-kube-stack.values.yaml.
Worker rollout collides with submission — restarting runtime-worker / api-gateway while traces are mid-pipeline silently orphans them (no error log, redis queues empty, trace just stops at whatever stage the worker was handling). Pattern: traces hang at the same stage cluster-wide right after a rollout. Rule of thumb: only redeploy aegis between rounds, after terminals_round<N>.tsv fully reaped.

Output Shape

Name the most likely blocker classes first.
Separate contract problems from product bugs, environment bugs, and tooling bugs.
Point to the next inspection surface instead of writing a full playbook.
Prefer concise hypotheses like "check X because Y" over exhaustive detail.

Project-specific inspection surfaces (aegis)

When triaging in this repo, common concrete surfaces worth naming early:

aegis/docs/troubleshooting/{datapack-schema,app-label-key,benchmark-integration-playbook}.md — consolidated E2E pitfalls.
aegis/docs/deployment/kind/otel-collector-{cfg,rbac,externalname}.yaml — kind-profile collector manifests (k8sattributes + upsert + 3-signal pipelines).
pkg/guidedcli — guided-first inject pipeline; /translate and GET /metadata are 410.
etcd injection.system.* — runtime source of truth for injection config (not YAML).
ClickHouse OTLP traces + Redis task keys — the usual "did the data path complete" inspection.
Per-stage failure triage: datapack.build.failed → inspect the job pod logs (kubectl logs -n exp <task-uuid>-xxxx) for UNKNOWN_TABLE (collector missing pipeline), Parquet file has no data rows: X.parquet (wrong env-var filename or missing resource enrichment), or No such file or directory: /data/drain_template/*.bin (initDrainTemplate disabled but detector algo still wired in).
--app-label-key must match the workload's actual label key (otel-demo uses app.kubernetes.io/name, not the default app). Mismatch surfaces as "app not found; available apps: <subset>" at backend submit.

ナビゲーション

Skillsとは？

リンク

regression-e2e

Regression / E2E Triage

Triage Order

Common Problem Buckets

Output Shape

Project-specific inspection surfaces (aegis)

関連スキル(🔧 開発ツール)