Voice AI Agent Skill Guide
This document teaches any AI coding assistant how to build a voice-enabled agent using Amazon Nova 2 Sonic, Strands Agents SDK, and Amazon Bedrock AgentCore Runtime. It is distilled from the insurance claims FNOL agent in this repository but applies to any domain where a voice interface submits data through an existing API.
The guide is tool-agnostic — it works with Claude Code, Cursor, Kiro, Cline, Windsurf, or any assistant that can read a markdown file.
Quick Start
Prerequisites
- AWS account with Bedrock model access for
amazon.nova-2-sonic-v1:0 - Node.js 22.x, AWS CDK 2.235+, Python 3.12+, Docker Desktop
- AWS CLI configured with credentials
Deploy and Verify
git clone https://github.com/aws-samples/serverless-eda-insurance-claims-processing.git
cd serverless-eda-insurance-claims-processing
npm install
npm run deploy # Deploys all stacks including VoiceFnolStack
After deployment, the CDK output includes the AgentCore WebSocket endpoint ARN. The React frontend connects to this endpoint with SigV4-signed WebSocket URLs.
To verify the agent is running, check the AgentCore Runtime status in the AWS console under Amazon Bedrock > AgentCore > Runtimes.
Architecture at a Glance
Browser (React)
| SigV4-signed WebSocket
v
AgentCore Runtime (managed container hosting)
| Docker container (Python 3.13, ARM64)
v
Strands BidiAgent + Nova 2 Sonic
| Tool calls
v
Customer API (GET) + FNOL API (POST)
| |
v v
DynamoDB EventBridge --> SQS --> Lambda --> IoT Core MQTT
(read policy) (Claim.Requested) (Claim.Accepted/Rejected)
The voice agent is a new entry point into an existing event-driven backend. Everything downstream of the FNOL API — fraud detection, settlement, notification — runs unchanged. The agent integrates at the API boundary, not the event bus.
For the full blog post, see: Extending an event-driven insurance claims application with Voice AI
Core Concepts
BidiAgent and Bidirectional Streaming
Strands BidiAgent manages a bidirectional audio stream between the client and Nova 2 Sonic. It accepts async callables for input and output (websocket.receive_json / websocket.send_json), wires them into its internal event loop, and dispatches tool calls as they arise. The agent is initialized once and reused across sessions — model loading happens at startup, not per connection.
Nova 2 Sonic: Speech-to-Speech in a Single Pass
Nova 2 Sonic is not a wrapper around separate ASR and TTS services. The model performs speech understanding, reasoning, tool calling, and speech generation in a single bidirectional stream — raw PCM audio in, raw PCM audio out. Tone, hesitation, and emphasis reach the model directly. Barge-in detection is built into the model server-side. Polyglot voices (e.g., "tiffany") support mid-sentence language switching.
AgentCore Runtime: Serverless Container Hosting
AgentCore Runtime hosts the agent container behind a single WebSocket endpoint. It handles authentication (SigV4), session routing, lifecycle management (5-min idle timeout, 1-hour max), and observability (CloudWatch Logs, X-Ray). Pay-as-you-go pricing charges only for active processing — I/O wait (waiting for Nova 2 Sonic or API responses) incurs no compute charge.
Tool-Based Design
Each agent capability maps to a @tool-decorated function with a bounded responsibility. Tools call existing APIs via SigV4-signed HTTP requests. The agent holds no direct knowledge of databases, event buses, or downstream processing. This makes each tool independently testable.
Build Your Own Voice Agent
Step 1: Define the Agent
Create the agent with a BidiNovaSonicModel and a set of tools. The agent is a singleton — initialize once, reuse across WebSocket sessions.
from strands.experimental.bidi.models.nova_sonic import BidiNovaSonicModel
from strands.experimental.bidi.agent import BidiAgent
def create_agent():
model = BidiNovaSonicModel(
model_id="amazon.nova-2-sonic-v1:0",
client_config={"region": os.environ["AWS_REGION"]},
provider_config={
"audio": {
"input_rate": 16000, # 16kHz from browser microphone
"output_rate": 24000, # 24kHz Nova Sonic synthesis
"format": "pcm",
"voice": "tiffany" # Polyglot voice with code-switching
}
}
)
return BidiAgent(
model=model,
tools=[your_tool_1, your_tool_2, stop_conversation],
system_prompt=SYSTEM_PROMPT
)
System prompt rules:
- Keep it conversational (3-5 sentences for the core persona)
- Use gender-appropriate pronouns for the selected voice
- Provide one-shot conversation examples for complex flows
- Do not use imperatives like "You must call tool X" — let the model decide tool timing
- Include safety-first guidance if the domain requires it (e.g., emergency assessment before data collection)
Reference: lib/services/voice-fnol-agent/app/agent.py
Step 2: Build Tools
Tools are Python functions decorated with @tool. Each tool returns a dictionary.
Basic tool:
from strands.tools import tool
@tool
async def your_lookup_tool(query: str) -> dict:
"""Retrieve data based on query."""
# Call your API here
return {"success": True, "data": result}
Tool with context (for user identity):
Use @tool(context=True) when the tool needs the caller's identity or session data. The invocation_state dictionary passed to agent.run() flows into every context-enabled tool.
from strands import tool, ToolContext
@tool(context=True)
async def get_customer_info(tool_context: ToolContext) -> dict:
"""Retrieve customer information using authenticated identity."""
cognito_id = tool_context.invocation_state['cognito_identity_id']
# Use cognito_id to call your API
return {"success": True, "customer": data}
Tool with inputSchema (critical for Nova Sonic):
Nova 2 Sonic constructs tool calls from audio, not text. It needs explicit field-level schemas with types and descriptions to map speech to structured parameters. Without inputSchema, the model cannot reliably map "it happened on Route 9 in Phoenix" to a nested location object.
@tool(
inputSchema={
"type": "object",
"properties": {
"incident": {
"type": "object",
"description": "Incident details",
"properties": {
"location": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"state": {"type": "string", "description": "State abbreviation"},
"road": {"type": "string", "description": "Street or road name"}
},
"required": ["city", "state", "road"]
},
"description": {
"type": "string",
"description": "What happened and damage description"
}
}
}
},
"required": ["incident"]
}
)
async def submit_data(incident: dict) -> dict:
"""Submit structured data to your API."""
# POST to your endpoint
return {"success": True}
SigV4 helper for AWS API calls:
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
def get_sigv4_headers(url, method, region, body=""):
session = boto3.Session()
credentials = session.get_credentials()
request = AWSRequest(method=method, url=url, data=body,
headers={"Content-Type": "application/json",
"Host": url.split("/")[2]})
SigV4Auth(credentials, "execute-api", region).add_auth(request)
return dict(request.headers)
Reference: lib/services/voice-fnol-agent/app/tools/
Step 3: Wire the WebSocket Handler
The BedrockAgentCoreApp class from the bedrock_agentcore package handles the WebSocket lifecycle. The handler wires the agent to the connection.
from bedrock_agentcore import BedrockAgentCoreApp, RequestContext
from app.agent import get_agent
app = BedrockAgentCoreApp()
# Agent is a singleton — initialized once, reused across sessions
agent = get_agent()
@app.websocket
async def websocket_handler(websocket, context: RequestContext):
# Extract custom headers (AgentCore lowercases them)
cognito_identity_id = context.request_headers.get(
'x-amzn-bedrock-agentcore-runtime-custom-cognitoidentityid')
await websocket.accept()
try:
await agent.run(
inputs=[websocket.receive_json],
outputs=[websocket.send_json],
invocation_state={'cognito_identity_id': cognito_identity_id}
)
except WebSocketDisconnect as e:
if getattr(e, 'code', None) != 1000:
logger.warning(f"Unexpected disconnect: {e}")
finally:
await agent.stop()
Key points:
inputsandoutputsaccept any async callable — Strands wires them into bidirectional streaminginvocation_stateflows to every@tool(context=True)decorated tool- Always call
agent.stop()infinallyto clean up resources
Reference: lib/services/voice-fnol-agent/app/app_agentcore.py
Step 4: CDK Infrastructure
Two CDK resources define the agent deployment: CfnRuntime (the agent) and CfnRuntimeEndpoint (the WebSocket endpoint).
import * as bedrockagentcore from "aws-cdk-lib/aws-bedrockagentcore";
import * as ecr_assets from "aws-cdk-lib/aws-ecr-assets";
// IAM role — trust principal MUST be bedrock-agentcore.amazonaws.com
const agentRole = new iam.Role(this, "AgentRole", {
assumedBy: new iam.ServicePrincipal("bedrock-agentcore.amazonaws.com", {
conditions: {
StringEquals: { "aws:SourceAccount": account },
ArnLike: { "aws:SourceArn": `arn:aws:bedrock-agentcore:${region}:${account}:*` }
}
})
});
// Docker image — AgentCore REQUIRES ARM64
const dockerImage = new ecr_assets.DockerImageAsset(this, "AgentImage", {
directory: path.join(__dirname, "../"),
platform: ecr_assets.Platform.LINUX_ARM64,
});
// AgentCore Runtime
const agentRuntime = new bedrockagentcore.CfnRuntime(this, "AgentRuntime", {
agentRuntimeName: "my_voice_agent",
roleArn: agentRole.roleArn,
networkConfiguration: { networkMode: "PUBLIC" },
agentRuntimeArtifact: {
containerConfiguration: { containerUri: dockerImage.imageUri }
},
lifecycleConfiguration: {
idleRuntimeSessionTimeout: 300, // 5 min idle timeout
maxLifetime: 3600 // 1 hour max
},
requestHeaderConfiguration: {
requestHeaderAllowlist: [
"X-Amzn-Bedrock-AgentCore-Runtime-Custom-CognitoIdentityId"
]
}
});
// AgentCore Runtime Endpoint
const agentEndpoint = new bedrockagentcore.CfnRuntimeEndpoint(this, "AgentEndpoint", {
name: "my_voice_agent_endpoint",
agentRuntimeId: agentRuntime.ref,
});
agentEndpoint.addDependency(agentRuntime);
Key CDK notes:
- Import is
aws-cdk-lib/aws-bedrockagentcore(notaws-cdk-lib/aws-bedrock) - Trust principal is
bedrock-agentcore.amazonaws.com(notbedrock.amazonaws.com) requestHeaderAllowlistheaders must be prefixed withX-Amzn-Bedrock-AgentCore-Runtime-Custom-- Without
requestHeaderAllowlist, custom headers are stripped silently at the AgentCore boundary - Grant
ecr:GetAuthorizationTokenon*andecr:BatchGetImageon the repository
Reference: lib/services/voice-fnol-agent/infra/voice-fnol-service.ts
Step 5: Frontend Audio
The frontend opens a SigV4-presigned WebSocket connection and streams PCM audio bidirectionally.
SigV4 presigned WebSocket URL:
const signer = new SignatureV4({
credentials: {
accessKeyId: credentials.accessKeyId,
secretAccessKey: credentials.secretAccessKey,
sessionToken: credentials.sessionToken
},
region: "us-east-1",
service: "bedrock-agentcore", // NOT "bedrock"
sha256: Sha256,
});
const signedRequest = await signer.presign(request, { expiresIn: 300 });
Audio capture (16kHz PCM in):
- Use
navigator.mediaDevices.getUserMedia()withsampleRate: 16000,channelCount: 1 - Enable
echoCancellationandnoiseSuppression - Convert Float32Array to Int16Array (PCM16) before sending
Audio playback (24kHz PCM out):
- Create
AudioContextat 24kHz - Schedule chunks at
nextPlayTimeto prevent gaps — do not callsource.start()without a time parameter - Track
activeSourcesarray for barge-in cancellation
Barge-in handling:
- Listen for
bidi_interruptionmessage type on the WebSocket - On interruption: stop all active audio sources, clear the queue, reset
nextPlayTimetoaudioContext.currentTime
Reference: react-claims/src/utils.js and react-claims/src/components/
Common Pitfalls
1. Missing inputSchema on tools
Symptom: Nova Sonic fails to call the tool or sends malformed parameters.
Cause: Text-based LLMs infer parameter structure from docstrings; a speech model cannot.
Fix: Define explicit inputSchema with types and descriptions on every tool that accepts structured parameters.
2. Creating agent per WebSocket connection
Symptom: High latency on every new connection, excessive memory usage.
Cause: Model loading and tool registration happen inside the handler instead of at startup.
Fix: Initialize the agent once at module level (agent = create_agent()), reuse across sessions.
3. Audio playback gaps
Symptom: Choppy, stuttering audio output.
Cause: Calling source.start() without scheduling — each chunk plays immediately instead of after the previous one finishes.
Fix: Track nextPlayTime and schedule each chunk: source.start(nextPlayTime); nextPlayTime += buffer.duration;
4. Ignoring bidi_interruption events
Symptom: Agent audio continues playing after the customer starts speaking.
Cause: Frontend does not listen for bidi_interruption messages from Nova Sonic.
Fix: On interruption, stop all active audio sources, clear the queue, and reset nextPlayTime.
5. Wrong IAM trust principal
Symptom: AgentCore fails to assume the IAM role; Runtime creation fails.
Cause: Trust policy uses bedrock.amazonaws.com instead of bedrock-agentcore.amazonaws.com.
Fix: Set the service principal to bedrock-agentcore.amazonaws.com with SourceAccount and SourceArn conditions.
6. Custom headers stripped silently
Symptom: tool_context.invocation_state has no user identity; get_customer_info fails.
Cause: The header is not listed in requestHeaderAllowlist on the CfnRuntime resource.
Fix: Add the header name to requestHeaderAllowlist. Headers must be prefixed with X-Amzn-Bedrock-AgentCore-Runtime-Custom-.
7. Missing invocation_state in agent.run()
Symptom: @tool(context=True) tools receive empty context, cannot access user identity.
Cause: agent.run() is called without the invocation_state parameter.
Fix: Pass invocation_state={"key": value} to agent.run().
Adapting This Pattern
To build a voice agent for a different domain:
-
Keep the skeleton. The
BidiAgent→BedrockAgentCoreApp→CfnRuntime→CfnRuntimeEndpointpattern is domain-independent. -
Replace the tools. Remove
get_customer_info,submit_to_fnol_api, etc. Add tools for your domain — each should call one API endpoint and return a dictionary. -
Rewrite the system prompt. Describe your agent's persona, conversation flow, and safety considerations. Keep it conversational (3-5 sentences for the core persona, then specific guidance).
-
Define inputSchemas. For every tool that accepts structured data, write an explicit JSON schema matching your API contract.
-
Update CDK environment variables. Replace
FNOL_API_ENDPOINTandCUSTOMER_API_ENDPOINTwith your API endpoints. Update IAM policies to grantexecute-api:Invokeon your specific API Gateway resources. -
Adjust frontend. Update the WebSocket connection URL and any custom headers. The audio capture/playback code is reusable as-is.
The voice infrastructure (AgentCore, Nova Sonic, SigV4 auth, audio streaming) stays identical. Only the tools, system prompt, and API endpoints change.
Using This File With Your Coding Assistant
Claude Code
Reference this file directly in your prompt, or add to your project's .claude/ instructions:
# In your conversation
@agent-skills/VOICE_AGENT_SKILL.md build me a voice agent for appointment scheduling
Cursor
Use @agent-skills/VOICE_AGENT_SKILL.md in chat, or copy the file into .cursor/rules/ for automatic inclusion.
Kiro
Copy to .kiro/steering/voice-agent.md and add frontmatter:
---
inclusion: auto
description: Voice AI agent development guide
tags: [voice-ai, nova-sonic, strands, agentcore]
---
Cline
Use @file reference in chat: @agent-skills/VOICE_AGENT_SKILL.md. Or add to .clinerules for automatic context.
Windsurf
Open as a tab and @-reference in Cascade. Or add to Windsurf Rules for automatic inclusion.
Generic / Manual
Paste the contents into your assistant's context window, or point it at the file path if it supports file reading.