Voice AI Agent Skill Guide

This document teaches any AI coding assistant how to build a voice-enabled agent using Amazon Nova 2 Sonic, Strands Agents SDK, and Amazon Bedrock AgentCore Runtime. It is distilled from the insurance claims FNOL agent in this repository but applies to any domain where a voice interface submits data through an existing API.

The guide is tool-agnostic — it works with Claude Code, Cursor, Kiro, Cline, Windsurf, or any assistant that can read a markdown file.

Quick Start

Prerequisites

AWS account with Bedrock model access for amazon.nova-2-sonic-v1:0
Node.js 22.x, AWS CDK 2.235+, Python 3.12+, Docker Desktop
AWS CLI configured with credentials

Deploy and Verify

git clone https://github.com/aws-samples/serverless-eda-insurance-claims-processing.git
cd serverless-eda-insurance-claims-processing
npm install
npm run deploy           # Deploys all stacks including VoiceFnolStack

After deployment, the CDK output includes the AgentCore WebSocket endpoint ARN. The React frontend connects to this endpoint with SigV4-signed WebSocket URLs.

To verify the agent is running, check the AgentCore Runtime status in the AWS console under Amazon Bedrock > AgentCore > Runtimes.

Architecture at a Glance

Browser (React)
  |  SigV4-signed WebSocket
  v
AgentCore Runtime (managed container hosting)
  |  Docker container (Python 3.13, ARM64)
  v
Strands BidiAgent + Nova 2 Sonic
  |  Tool calls
  v
Customer API (GET)  +  FNOL API (POST)
  |                      |
  v                      v
DynamoDB             EventBridge --> SQS --> Lambda --> IoT Core MQTT
(read policy)        (Claim.Requested)                  (Claim.Accepted/Rejected)

The voice agent is a new entry point into an existing event-driven backend. Everything downstream of the FNOL API — fraud detection, settlement, notification — runs unchanged. The agent integrates at the API boundary, not the event bus.

For the full blog post, see: Extending an event-driven insurance claims application with Voice AI

Core Concepts

BidiAgent and Bidirectional Streaming

Strands BidiAgent manages a bidirectional audio stream between the client and Nova 2 Sonic. It accepts async callables for input and output (websocket.receive_json / websocket.send_json), wires them into its internal event loop, and dispatches tool calls as they arise. The agent is initialized once and reused across sessions — model loading happens at startup, not per connection.

Nova 2 Sonic: Speech-to-Speech in a Single Pass

Nova 2 Sonic is not a wrapper around separate ASR and TTS services. The model performs speech understanding, reasoning, tool calling, and speech generation in a single bidirectional stream — raw PCM audio in, raw PCM audio out. Tone, hesitation, and emphasis reach the model directly. Barge-in detection is built into the model server-side. Polyglot voices (e.g., "tiffany") support mid-sentence language switching.

AgentCore Runtime: Serverless Container Hosting

AgentCore Runtime hosts the agent container behind a single WebSocket endpoint. It handles authentication (SigV4), session routing, lifecycle management (5-min idle timeout, 1-hour max), and observability (CloudWatch Logs, X-Ray). Pay-as-you-go pricing charges only for active processing — I/O wait (waiting for Nova 2 Sonic or API responses) incurs no compute charge.

Tool-Based Design

Each agent capability maps to a @tool-decorated function with a bounded responsibility. Tools call existing APIs via SigV4-signed HTTP requests. The agent holds no direct knowledge of databases, event buses, or downstream processing. This makes each tool independently testable.

Build Your Own Voice Agent

Step 1: Define the Agent

Create the agent with a BidiNovaSonicModel and a set of tools. The agent is a singleton — initialize once, reuse across WebSocket sessions.

from strands.experimental.bidi.models.nova_sonic import BidiNovaSonicModel
from strands.experimental.bidi.agent import BidiAgent

def create_agent():
    model = BidiNovaSonicModel(
        model_id="amazon.nova-2-sonic-v1:0",
        client_config={"region": os.environ["AWS_REGION"]},
        provider_config={
            "audio": {
                "input_rate": 16000,   # 16kHz from browser microphone
                "output_rate": 24000,  # 24kHz Nova Sonic synthesis
                "format": "pcm",
                "voice": "tiffany"     # Polyglot voice with code-switching
            }
        }
    )
    return BidiAgent(
        model=model,
        tools=[your_tool_1, your_tool_2, stop_conversation],
        system_prompt=SYSTEM_PROMPT
    )

System prompt rules:

Keep it conversational (3-5 sentences for the core persona)
Use gender-appropriate pronouns for the selected voice
Provide one-shot conversation examples for complex flows
Do not use imperatives like "You must call tool X" — let the model decide tool timing
Include safety-first guidance if the domain requires it (e.g., emergency assessment before data collection)

Reference: lib/services/voice-fnol-agent/app/agent.py

Step 2: Build Tools

Tools are Python functions decorated with @tool. Each tool returns a dictionary.

Basic tool:

from strands.tools import tool

@tool
async def your_lookup_tool(query: str) -> dict:
    """Retrieve data based on query."""
    # Call your API here
    return {"success": True, "data": result}

Tool with context (for user identity):

Use @tool(context=True) when the tool needs the caller's identity or session data. The invocation_state dictionary passed to agent.run() flows into every context-enabled tool.

from strands import tool, ToolContext

@tool(context=True)
async def get_customer_info(tool_context: ToolContext) -> dict:
    """Retrieve customer information using authenticated identity."""
    cognito_id = tool_context.invocation_state['cognito_identity_id']
    # Use cognito_id to call your API
    return {"success": True, "customer": data}

Tool with inputSchema (critical for Nova Sonic):

Nova 2 Sonic constructs tool calls from audio, not text. It needs explicit field-level schemas with types and descriptions to map speech to structured parameters. Without inputSchema, the model cannot reliably map "it happened on Route 9 in Phoenix" to a nested location object.

@tool(
    inputSchema={
        "type": "object",
        "properties": {
            "incident": {
                "type": "object",
                "description": "Incident details",
                "properties": {
                    "location": {
                        "type": "object",
                        "properties": {
                            "city": {"type": "string", "description": "City name"},
                            "state": {"type": "string", "description": "State abbreviation"},
                            "road": {"type": "string", "description": "Street or road name"}
                        },
                        "required": ["city", "state", "road"]
                    },
                    "description": {
                        "type": "string",
                        "description": "What happened and damage description"
                    }
                }
            }
        },
        "required": ["incident"]
    }
)
async def submit_data(incident: dict) -> dict:
    """Submit structured data to your API."""
    # POST to your endpoint
    return {"success": True}

SigV4 helper for AWS API calls:

import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest

def get_sigv4_headers(url, method, region, body=""):
    session = boto3.Session()
    credentials = session.get_credentials()
    request = AWSRequest(method=method, url=url, data=body,
                         headers={"Content-Type": "application/json",
                                  "Host": url.split("/")[2]})
    SigV4Auth(credentials, "execute-api", region).add_auth(request)
    return dict(request.headers)

Reference: lib/services/voice-fnol-agent/app/tools/

Step 3: Wire the WebSocket Handler

The BedrockAgentCoreApp class from the bedrock_agentcore package handles the WebSocket lifecycle. The handler wires the agent to the connection.

from bedrock_agentcore import BedrockAgentCoreApp, RequestContext
from app.agent import get_agent

app = BedrockAgentCoreApp()

# Agent is a singleton — initialized once, reused across sessions
agent = get_agent()

@app.websocket
async def websocket_handler(websocket, context: RequestContext):
    # Extract custom headers (AgentCore lowercases them)
    cognito_identity_id = context.request_headers.get(
        'x-amzn-bedrock-agentcore-runtime-custom-cognitoidentityid')

    await websocket.accept()
    try:
        await agent.run(
            inputs=[websocket.receive_json],
            outputs=[websocket.send_json],
            invocation_state={'cognito_identity_id': cognito_identity_id}
        )
    except WebSocketDisconnect as e:
        if getattr(e, 'code', None) != 1000:
            logger.warning(f"Unexpected disconnect: {e}")
    finally:
        await agent.stop()

Key points:

inputs and outputs accept any async callable — Strands wires them into bidirectional streaming
invocation_state flows to every @tool(context=True) decorated tool
Always call agent.stop() in finally to clean up resources

Reference: lib/services/voice-fnol-agent/app/app_agentcore.py

Step 4: CDK Infrastructure

Two CDK resources define the agent deployment: CfnRuntime (the agent) and CfnRuntimeEndpoint (the WebSocket endpoint).

import * as bedrockagentcore from "aws-cdk-lib/aws-bedrockagentcore";
import * as ecr_assets from "aws-cdk-lib/aws-ecr-assets";

// IAM role — trust principal MUST be bedrock-agentcore.amazonaws.com
const agentRole = new iam.Role(this, "AgentRole", {
  assumedBy: new iam.ServicePrincipal("bedrock-agentcore.amazonaws.com", {
    conditions: {
      StringEquals: { "aws:SourceAccount": account },
      ArnLike: { "aws:SourceArn": `arn:aws:bedrock-agentcore:${region}:${account}:*` }
    }
  })
});

// Docker image — AgentCore REQUIRES ARM64
const dockerImage = new ecr_assets.DockerImageAsset(this, "AgentImage", {
  directory: path.join(__dirname, "../"),
  platform: ecr_assets.Platform.LINUX_ARM64,
});

// AgentCore Runtime
const agentRuntime = new bedrockagentcore.CfnRuntime(this, "AgentRuntime", {
  agentRuntimeName: "my_voice_agent",
  roleArn: agentRole.roleArn,
  networkConfiguration: { networkMode: "PUBLIC" },
  agentRuntimeArtifact: {
    containerConfiguration: { containerUri: dockerImage.imageUri }
  },
  lifecycleConfiguration: {
    idleRuntimeSessionTimeout: 300,  // 5 min idle timeout
    maxLifetime: 3600                // 1 hour max
  },
  requestHeaderConfiguration: {
    requestHeaderAllowlist: [
      "X-Amzn-Bedrock-AgentCore-Runtime-Custom-CognitoIdentityId"
    ]
  }
});

// AgentCore Runtime Endpoint
const agentEndpoint = new bedrockagentcore.CfnRuntimeEndpoint(this, "AgentEndpoint", {
  name: "my_voice_agent_endpoint",
  agentRuntimeId: agentRuntime.ref,
});
agentEndpoint.addDependency(agentRuntime);

Key CDK notes:

Import is aws-cdk-lib/aws-bedrockagentcore (not aws-cdk-lib/aws-bedrock)
Trust principal is bedrock-agentcore.amazonaws.com (not bedrock.amazonaws.com)
requestHeaderAllowlist headers must be prefixed with X-Amzn-Bedrock-AgentCore-Runtime-Custom-
Without requestHeaderAllowlist, custom headers are stripped silently at the AgentCore boundary
Grant ecr:GetAuthorizationToken on * and ecr:BatchGetImage on the repository

Reference: lib/services/voice-fnol-agent/infra/voice-fnol-service.ts

Step 5: Frontend Audio

The frontend opens a SigV4-presigned WebSocket connection and streams PCM audio bidirectionally.

SigV4 presigned WebSocket URL:

const signer = new SignatureV4({
  credentials: {
    accessKeyId: credentials.accessKeyId,
    secretAccessKey: credentials.secretAccessKey,
    sessionToken: credentials.sessionToken
  },
  region: "us-east-1",
  service: "bedrock-agentcore",   // NOT "bedrock"
  sha256: Sha256,
});
const signedRequest = await signer.presign(request, { expiresIn: 300 });

Audio capture (16kHz PCM in):

Use navigator.mediaDevices.getUserMedia() with sampleRate: 16000, channelCount: 1
Enable echoCancellation and noiseSuppression
Convert Float32Array to Int16Array (PCM16) before sending

Audio playback (24kHz PCM out):

Create AudioContext at 24kHz
Schedule chunks at nextPlayTime to prevent gaps — do not call source.start() without a time parameter
Track activeSources array for barge-in cancellation

Barge-in handling:

Listen for bidi_interruption message type on the WebSocket
On interruption: stop all active audio sources, clear the queue, reset nextPlayTime to audioContext.currentTime

Reference: react-claims/src/utils.js and react-claims/src/components/

Common Pitfalls

1. Missing inputSchema on tools

Symptom: Nova Sonic fails to call the tool or sends malformed parameters. Cause: Text-based LLMs infer parameter structure from docstrings; a speech model cannot. Fix: Define explicit inputSchema with types and descriptions on every tool that accepts structured parameters.

2. Creating agent per WebSocket connection

Symptom: High latency on every new connection, excessive memory usage. Cause: Model loading and tool registration happen inside the handler instead of at startup. Fix: Initialize the agent once at module level (agent = create_agent()), reuse across sessions.

3. Audio playback gaps

Symptom: Choppy, stuttering audio output. Cause: Calling source.start() without scheduling — each chunk plays immediately instead of after the previous one finishes. Fix: Track nextPlayTime and schedule each chunk: source.start(nextPlayTime); nextPlayTime += buffer.duration;

4. Ignoring bidi_interruption events

Symptom: Agent audio continues playing after the customer starts speaking. Cause: Frontend does not listen for bidi_interruption messages from Nova Sonic. Fix: On interruption, stop all active audio sources, clear the queue, and reset nextPlayTime.

5. Wrong IAM trust principal

Symptom: AgentCore fails to assume the IAM role; Runtime creation fails. Cause: Trust policy uses bedrock.amazonaws.com instead of bedrock-agentcore.amazonaws.com. Fix: Set the service principal to bedrock-agentcore.amazonaws.com with SourceAccount and SourceArn conditions.

6. Custom headers stripped silently

Symptom: tool_context.invocation_state has no user identity; get_customer_info fails. Cause: The header is not listed in requestHeaderAllowlist on the CfnRuntime resource. Fix: Add the header name to requestHeaderAllowlist. Headers must be prefixed with X-Amzn-Bedrock-AgentCore-Runtime-Custom-.

7. Missing invocation_state in agent.run()

Symptom: @tool(context=True) tools receive empty context, cannot access user identity. Cause: agent.run() is called without the invocation_state parameter. Fix: Pass invocation_state={"key": value} to agent.run().

Adapting This Pattern

To build a voice agent for a different domain:

Keep the skeleton. The BidiAgent → BedrockAgentCoreApp → CfnRuntime → CfnRuntimeEndpoint pattern is domain-independent.
Replace the tools. Remove get_customer_info, submit_to_fnol_api, etc. Add tools for your domain — each should call one API endpoint and return a dictionary.
Rewrite the system prompt. Describe your agent's persona, conversation flow, and safety considerations. Keep it conversational (3-5 sentences for the core persona, then specific guidance).
Define inputSchemas. For every tool that accepts structured data, write an explicit JSON schema matching your API contract.
Update CDK environment variables. Replace FNOL_API_ENDPOINT and CUSTOMER_API_ENDPOINT with your API endpoints. Update IAM policies to grant execute-api:Invoke on your specific API Gateway resources.
Adjust frontend. Update the WebSocket connection URL and any custom headers. The audio capture/playback code is reusable as-is.

The voice infrastructure (AgentCore, Nova Sonic, SigV4 auth, audio streaming) stays identical. Only the tools, system prompt, and API endpoints change.

Using This File With Your Coding Assistant

Claude Code

Reference this file directly in your prompt, or add to your project's .claude/ instructions:

# In your conversation
@agent-skills/VOICE_AGENT_SKILL.md build me a voice agent for appointment scheduling

Cursor

Use @agent-skills/VOICE_AGENT_SKILL.md in chat, or copy the file into .cursor/rules/ for automatic inclusion.

Kiro

Copy to .kiro/steering/voice-agent.md and add frontmatter:

---
inclusion: auto
description: Voice AI agent development guide
tags: [voice-ai, nova-sonic, strands, agentcore]
---

Cline

Use @file reference in chat: @agent-skills/VOICE_AGENT_SKILL.md. Or add to .clinerules for automatic context.

Windsurf

Open as a tab and @-reference in Cascade. Or add to Windsurf Rules for automatic inclusion.

Generic / Manual

Paste the contents into your assistant's context window, or point it at the file path if it supports file reading.

ナビゲーション

Skillsとは？

リンク

Voice AI Agent Skill Guide

Voice AI Agent Skill Guide

Quick Start

Prerequisites

Deploy and Verify

Architecture at a Glance

Core Concepts

BidiAgent and Bidirectional Streaming

Nova 2 Sonic: Speech-to-Speech in a Single Pass

AgentCore Runtime: Serverless Container Hosting

Tool-Based Design

Build Your Own Voice Agent

Step 1: Define the Agent

Step 2: Build Tools

Step 3: Wire the WebSocket Handler

Step 4: CDK Infrastructure

Step 5: Frontend Audio

Common Pitfalls

1. Missing inputSchema on tools

2. Creating agent per WebSocket connection

3. Audio playback gaps

4. Ignoring bidi_interruption events

5. Wrong IAM trust principal

6. Custom headers stripped silently

7. Missing invocation_state in agent.run()

Adapting This Pattern

Using This File With Your Coding Assistant

Claude Code

Cursor

Kiro

Cline

Windsurf

Generic / Manual

References

関連スキル(🌐 Web開発)