name: investigate-stuck-messages description: Investigate stuck messages in relayer queue. Use when alerts mention "queue length > 0", to diagnose why messages are stuck, or to get message IDs for denylisting.
Investigate Stuck Messages
Query the relayer API to investigate stuck messages, their retry counts, and error reasons.
When to Use
-
Alert-based triggers:
- Alert: "Known app context relayer queue length > 0 for 40m"
- Any alert mentioning stuck messages in prepare queue
- High retry counts for specific app contexts
-
User request triggers:
- "Why are messages stuck for [app_context]?"
- "Investigate stuck messages on [chain]"
- "What's causing the queue alert?"
- Pasting a Grafana alert URL
Input Parameters
Option 1: Grafana Alert URL (recommended)
/investigate-stuck-messages https://abacusworks.grafana.net/alerting/grafana/cdg1ro5hi4vswb/view?tab=instances
Option 2: Manual specification
/investigate-stuck-messages app_context=EZETH/renzo-prod remote=linea
| Parameter | Required | Default | Description |
|---|---|---|---|
alert_url | No | - | Grafana alert URL (extracts app_context/remote from firing instances) |
app_context | No* | - | The app context (e.g., EZETH/renzo-prod, oUSDT/production) |
remote | No* | - | Destination chain name (e.g., linea, ethereum, arbitrum) |
environment | No | mainnet3 | Deployment environment |
*Either alert_url OR both app_context and remote must be provided.
Workflow
Step 1: Parse Input and Extract Alert Instances
If Grafana alert URL provided:
-
Extract the alert UID from the URL (e.g.,
cdg1ro5hi4vswbfrom.../alerting/grafana/cdg1ro5hi4vswb/view) -
Query Prometheus directly for firing instances using
mcp__grafana__query_prometheus:sum by (app_context, remote)( max_over_time( hyperlane_submitter_queue_length{ queue_name="prepare_queue", app_context!~"Unknown|merkly_eth|merkly_erc20|helloworld|velo_message_module", hyperlane_context!~"rc|vanguard0|vanguard1|vanguard2|vanguard3|vanguard4|vanguard5", operation_status!~"Retry\\(ApplicationReport\\(.*\\)\\)|FirstPrepareAttempt", hyperlane_deployment="mainnet3", }[2m] ) ) > 0 -
Extract
app_contextandremotelabels from each result.
If manual app_context/remote provided:
Use the provided values directly.
Step 2: Setup Port-Forward to Relayer
Check if port 9090 is already in use:
lsof -i :9090
If not in use, start port-forward in background:
kubectl port-forward omniscient-relayer-hyperlane-agent-relayer-0 9090 -n mainnet3 &
Wait a few seconds for the port-forward to establish.
Step 3: Get Domain IDs for Chains
Look up domain IDs from the registry:
cat node_modules/.pnpm/@hyperlane-xyz+registry@*/node_modules/@hyperlane-xyz/registry/dist/chains/<chain>/metadata.json | jq '.domainId'
Common domain IDs:
- ethereum: 1
- optimism: 10
- arbitrum: 42161
- polygon: 137
- base: 8453
- unichain: 130
- avalanche: 43114
Step 4: Query Relayer API
For each destination chain, query the relayer API:
curl -s 'http://localhost:9090/list_operations?destination_domain=<DOMAIN_ID>' > /tmp/<chain>.json
The response contains operations with:
id: Message ID (H256)operation.message.sender: Sender addressoperation.message.recipient: Recipient addressoperation.num_retries: Number of retries (higher = more stuck)operation.status: Error status (e.g.,{"Retry": "ErrorEstimatingGas"})operation.message.origin: Origin domain IDoperation.message.destination: Destination domain IDoperation.app_context: App context name
Step 5: Filter Messages by App Context
Look up the app_context in rust/main/app-contexts/mainnet_config.json:
jq '.metricAppContexts[] | select(.name == "<APP_CONTEXT>")' rust/main/app-contexts/mainnet_config.json
Filter API results to only include messages where:
operation.message.recipientmatches one of therecipientAddressvalues for that destination domain
Important: Addresses are padded to 32 bytes (H256 format).
Step 6: Query GCP Logs for Actual Errors
Calculate log freshness based on retry count:
The relayer uses exponential backoff (see calculate_msg_backoff in rust/main/agents/relayer/src/msg/pending_message.rs):
| Retries | Backoff/retry | Cumulative Time | Freshness Flag |
|---|---|---|---|
| 1-4 | 5s-1min | ~2min | --freshness=1h |
| 5-24 | 3min | ~1h | --freshness=3h |
| 25-39 | 5-26min | ~5h | --freshness=12h |
| 40-49 | 30min-1h | ~12h | --freshness=24h |
| 50-60 | 2-22h | ~35h | --freshness=3d |
| 60+ | 22h+ | 35h+ | --freshness=7d |
For each message ID, query GCP logs with calculated freshness:
gcloud logging read 'resource.type=k8s_container AND resource.labels.namespace_name=mainnet3 AND resource.labels.pod_name:omniscient-relayer AND jsonPayload.span.id:<MESSAGE_ID> AND jsonPayload.fields.error:*' --project=abacus-labs-dev --limit=1 --format='value(jsonPayload.fields.error)' --freshness=<CALCULATED_FRESHNESS>
Extract the human-readable error from the response using sed (macOS compatible):
echo "$raw_error" | sed -n 's/.*execution reverted: \([^"]*\)".*/\1/p' | head -1
Common error patterns:
"execution reverted: Nonce already used"→ "Nonce already used""execution reverted: panic: arithmetic underflow"→ "Arithmetic underflow"
Note: Do not use grep -P as it's not available on macOS.
Step 7: Present Investigation Results
Output a detailed summary table with full message IDs and both error sources:
## Investigation Results for [APP_CONTEXT]
### Summary
- Total stuck messages: X
- Destinations affected: [list]
- Reprepare reasons: ErrorEstimatingGas (N), CouldNotFetchMetadata (M)
### Messages
| Message ID | Retries | Reprepare Reason | Error | Origin |
|------------|---------|------------------|-----------|--------|
| `0xaa18ebc1c79345e6d24984a0b9a5ab66c968d128d46b2357b641e56e71b8d30c` | 47 | ErrorEstimatingGas | Nonce already used | optimism |
| `0xd6aeef7c092a88aa23ad53227aeb834ae731d059b3ce749db8451e761f3f15ac` | 47 | ErrorEstimatingGas | Nonce already used | arbitrum |
**Important**: Always show the full 66-character message ID (0x + 64 hex chars). Do not truncate.
### Error Analysis
[Explain based on the actual log errors found]
### Next Steps
To denylist these messages, run:
/denylist-stuck-messages <message_ids> app_context=APP_CONTEXT
Column definitions:
- Reprepare Reason: From
operation.statusin relayer API (e.g., ErrorEstimatingGas, CouldNotFetchMetadata) - Error: Actual revert reason from GCP logs (e.g., "Nonce already used", "Arithmetic underflow")
Step 8: Output Denylist Command
At the end of the investigation results, output the full denylist command:
### Next Steps
To denylist, run:
/denylist-stuck-messages 0xaa18ebc1c79345e6d24984a0b9a5ab66c968d128d46b2357b641e56e71b8d30c 0xd6aeef7c092a88aa23ad53227aeb834ae731d059b3ce749db8451e761f3f15ac app_context=APP_CONTEXT
Always use full message IDs, never truncated.
Error Status Reference
| Status | Meaning | Action |
|---|---|---|
ErrorEstimatingGas | Gas estimation failed (contract revert) | Usually denylist - contract won't accept |
CouldNotFetchMetadata | Can't get ISM metadata | Check validators, may resolve itself |
ApplicationReport(...) | App-specific error | Check the specific error message |
GasPaymentNotFound | No IGP payment | May need manual relay with gas |
Error Handling
- Port-forward fails: Check kubectl context:
kubectl config current-context - No messages found: Queue may have cleared; alert may be stale
- API returns error: Check relayer pod:
kubectl get pods -n mainnet3 | grep relayer - App context not found: May be new/custom; ask user for sender/recipient addresses
Prerequisites
kubectlconfigured with access to mainnet cluster- Grafana MCP server connected (for alert URL parsing)