name: regression-e2e description: Triage frame for end-to-end (E2E) test, smoke test, regression, and fault-injection validation failures. Use when the user reports an E2E/smoke/regression run is broken, a fault-injection flow fails mid-pipeline, a previously-green test is now flaky, or asks to "triage", "debug the E2E", "why is the regression failing", "help me figure out what broke the smoke test". Produces a short hypothesis list and next inspection surfaces rather than a full runbook. Trigger words — e2e, end-to-end, regression, smoke test, fault injection validation, flaky test, test broke.
Regression / E2E Triage
See also: the
aegisctlskill for general CLI composition (NDJSON streaming, name-not-id filters). Specificaegisctl …invocations below are illustrative —aegisctl <noun> [verb] --helpis the source of truth and supersedes anything that drifts here.
Provide a short triage frame for E2E failures. Focus on likely problem classes and next inspection directions rather than detailed runbooks.
Triage Order
- Confirm the test contract still matches product behavior.
- Stale expected terminal events, outdated fixtures, renamed statuses, old snapshots.
- Confirm the environment is actually ready.
- Service health, cluster dependencies, seeded config, storage, queues, required ports.
- Confirm version alignment across the stack.
- Local CLI / test harness, deployed backend, worker images, schemas, container tags — look for drift.
- Confirm auth and config wiring.
- Tokens, permissions, endpoints, namespaces, feature flags, project/tenant resolution.
- Confirm rerun semantics.
- Dedupe, idempotency, cleanup, leftover state that blocks identical reruns.
- Confirm the data path completes.
- Fault creation, task execution, artifact generation, uploads, persistence, downstream reads.
- Confirm observability is sufficient.
- Follow one trace / job / execution end-to-end across logs, events, and DB state.
Common Problem Buckets
- Stale contract — product is healthy, but the test asserts an old event, payload, or sequence.
- Environment drift — docs, fixtures, or assumptions no longer match the real local or CI environment.
- Version drift — wrong binary/image is running, or a remote pull overrides the intended local build.
- Auth or config mismatch — the flow starts, but later steps fail because required creds, flags, or routing are missing.
- Timing / race conditions — async consumers, startup latency, or eventual consistency make the test flaky.
- Replay / dedupe behavior — a new run is suppressed because an older valid run or cached state still exists.
- Missing result data — workflow appears complete, but artifacts, uploads, or DB rows are incomplete/absent.
- Weak observability — user-facing error is generic; the real blocker is only visible in trace or runtime logs.
- Selector/label mismatch from operators — app-label-based selectors return a partial set of services because operator-generated pods (Coherence, Kafka, Postgres controllers) carry different labels than plain Deployments. Pattern to watch: "everything installs healthy, but the selector only sees N of M services."
- Bootstrap-edge gaps — seed path populates some stores (DB, configmap, etc.) but not the one the runtime actually reads on first boot. Symptom: the data is "there" somewhere, but the process that needs it gets
not foundthe first time and works after a manual reconciliation. - Duplicate-submission suppression — aegisctl regression silently refuses re-submitting the same (app, namespace, chaos_type, duration) within a cooldown; symptom is
duplicate submission suppressed (batches [0]). Varyspec[].appordurationto force a fresh hash. - App name filtered by serviceendpoints — backend pre-filters live pod labels against the
serviceendpointsmetadata store before returning them to guided-resolve. A pod WITH the label (e.g.currencyin otel-demo) still errorsapp ... not found; available apps: accounting, ad, cart, ...if it isn't in the endpoint map. Pick an app that IS in the "available apps" list the error prints. - Empty parquet with
.parquetsuffix required —RCABENCH_OPTIONAL_EMPTY_PARQUETSvalues match on the full filename INCLUDING.parquet. Leaving the extension off silently fails to opt in. - Namespace landed on the wrong slot — submit asked for
<sys>14but the actual fault ran in<sys>0. On a current backend this should fail loudly instead; if it doesn't, the running binary is too old. Rebuild and redeploy. As a workaround, re-submit viaaegisctl inject guided --apply --autoand let the server pick a slot. - Single HTTP injection rejected as a duplicate — submit returns
service 'X' at positions 0 and 0warning and zero items. Self-loops (e.g.GET /against the same app) used to trip this. On a current backend it's gone; if it appears, the backend predates the fix or the same regression has resurfaced. Workaround: change the route or method to break the self-loop. - Auto-allocate fails with "pool exhausted" — every namespace in the system's pool is either locked or empty. Two clean fixes: pass
--allow-bootstrapso the server extends the pool with a fresh slot, or pre-deploy a workload viaaegisctl inject guided --install --namespace <sys>Nto give an existing slot something to inject against. - Asked for a specific namespace, got "not found in current configuration" — that namespace isn't registered in the system's pool yet. Easiest fix is to re-submit through
aegisctl inject guided --apply --namespace <sys>N(current backends register it on submit). If that doesn't help, the namespace was created out-of-band; expand the system's count via the systems admin API and retry. - Silent trace drop, one-system-only — one benchmark (typically
ts*ortea*) shows zero spans in ClickHouse for the last hour while peer systems flow normally. Root cause is almost always the OTel deployment-collector HPA pegged atmaxReplicaswithmemory_limiterreturningUNAVAILABLE: data refused due to high memory usage. Java agents (TT, teastore) don't retry on UNAVAILABLE — spans dropped on the floor. Go / Node SDKs use BatchSpanProcessor + retry, so they squeeze through. Checkkubectl -n monitoring get hpa opentelemetry-kube-stack-deployment-collector: ifREPLICAS = MAXPODSandmemory > 100% of target, raiseautoscaler.maxReplicasandresources.limits.memoryinAegisLab/manifests/byte-cluster/otel-kube-stack.values.yaml. - Worker rollout collides with submission — restarting
runtime-worker/api-gatewaywhile traces are mid-pipeline silently orphans them (no error log, redis queues empty, trace just stops at whatever stage the worker was handling). Pattern: traces hang at the same stage cluster-wide right after a rollout. Rule of thumb: only redeploy aegis between rounds, afterterminals_round<N>.tsvfully reaped.
Output Shape
- Name the most likely blocker classes first.
- Separate contract problems from product bugs, environment bugs, and tooling bugs.
- Point to the next inspection surface instead of writing a full playbook.
- Prefer concise hypotheses like "check X because Y" over exhaustive detail.
Project-specific inspection surfaces (aegis)
When triaging in this repo, common concrete surfaces worth naming early:
aegis/docs/troubleshooting/{datapack-schema,app-label-key,benchmark-integration-playbook}.md— consolidated E2E pitfalls.aegis/docs/deployment/kind/otel-collector-{cfg,rbac,externalname}.yaml— kind-profile collector manifests (k8sattributes + upsert + 3-signal pipelines).pkg/guidedcli— guided-first inject pipeline;/translateandGET /metadataare 410.- etcd
injection.system.*— runtime source of truth for injection config (not YAML). - ClickHouse OTLP traces + Redis task keys — the usual "did the data path complete" inspection.
- Per-stage failure triage:
datapack.build.failed→ inspect the job pod logs (kubectl logs -n exp <task-uuid>-xxxx) for UNKNOWN_TABLE (collector missing pipeline),Parquet file has no data rows: X.parquet(wrong env-var filename or missing resource enrichment), orNo such file or directory: /data/drain_template/*.bin(initDrainTemplate disabled but detector algo still wired in). --app-label-keymust match the workload's actual label key (otel-demo usesapp.kubernetes.io/name, not the defaultapp). Mismatch surfaces as "app not found; available apps: <subset>" at backend submit.