Voice Agents at Scale: ElevenLabs + Twilio + Claude

By OrionAI Build Editorial · Published 2026-05-10 · // build

The latency budget

Human conversation tolerates a pause of roughly 600‑800 ms before the listener assumes the speaker has finished. In a phone‑based AI agent that window shrinks: the caller expects the system to feel present, not a robot. Our target is ≤800 ms from the moment the caller’s voice hits the microphone to the first audible response. Anything slower feels disjointed and drives abandonment.

Breaking the budget into stages makes the problem tractable. In our production runs the following targets hold:

Streaming STT first partial: ~150 ms after the first phoneme. We use Deepgram or ElevenLabs Speech in streaming mode, which emit a partial transcript as soon as the acoustic model stabilises.
LLM first token: 250‑400 ms. Claude‑3.5 Sonnet, accessed via stream=true, begins streaming tokens once the request reaches the inference server. A warm HTTP/2 connection and keep‑alive sockets shave ~50 ms.
TTS first audio chunk: 150‑250 ms. ElevenLabs’ stream=true endpoint returns the first 200 ms of audio within this window; Cartesia can be a drop‑in fallback with comparable latency.
Network jitter buffer: 50‑100 ms. Twilio’s Media Streams add a small buffer to smooth packet loss; we tune the buffer size dynamically based on observed RTT.

The critical insight: the pipeline must be fully streaming. Any synchronous, blocking call—e.g., a “batch” STT request that waits for the utterance to end—adds hundreds of milliseconds and busts the budget.

The end‑to‑end pipeline

Twilio Media Stream over WebSocket – raw 16 kHz PCM frames flow from the caller to our edge service. We keep the WebSocket alive for the call duration to avoid reconnection overhead.
Streaming speech‑to‑text – each frame is forwarded to the STT provider. Partial transcripts are emitted every ~100 ms, allowing downstream components to act before the user finishes speaking.
Endpoint detection – a lightweight voice‑activity detector (WebRTC VAD) flags silence. We augment it with a tiny classifier (“are they still talking?”) that looks at prosody to decide whether to wait for a continuation or to cut to the LLM.
LLM inference (streamed) – as soon as the VAD signals a pause, we send the buffered transcript to Claude‑3.5 Sonnet with stream=true. Tokens arrive in 20‑40 ms bursts; the first token triggers TTS.
Streaming text‑to‑speech – the first token is handed to ElevenLabs TTS. The service immediately returns an audio chunk that we pipe back into Twilio’s Play stream, overlapping with the rest of the LLM response.

Each stage runs in its own async task, coordinated by a small state machine that tracks “listening”, “thinking”, and “speaking”. The state machine guarantees that a later stage never blocks an earlier one.

Barge‑in handling

Interruptibility is non‑negotiable. If the caller interjects while the agent is speaking, the system must cease playback within 200 ms to avoid a jarring overlap.

We run VAD on the inbound audio stream continuously, even while TTS is playing.
When VAD flips to “speech” and the current state is “speaking”, we send a stop control frame to Twilio, which instantly halts the media stream.
Simultaneously we abort the in‑flight TTS request (most providers expose a cancellation endpoint) and discard any LLM tokens that arrived after the interruption point.
The buffered partial transcript up to the interruption is fed back into the LLM as context, allowing the model to acknowledge the interjection on the next turn.

Fallback paths and resiliency

Production telephony cannot afford a single point of failure. We built three orthogonal fallback layers:

LLM latency guard – if the first token does not arrive within 2 s, we inject a pre‑recorded “One moment while I look that up” audio snippet. The snippet buys us time while the model continues processing.
LLM retry & human handoff – after three consecutive LLM errors (timeout, 5xx, or malformed JSON), we terminate the AI flow and route the call to a voicemail box with a friendly “Sorry, I’m having trouble. Please leave a message.”
TTS redundancy – we wrap ElevenLabs and Cartesia behind a thin abstraction. If ElevenLabs returns a 429 or experiences a latency spike, we instantly switch to Cartesia without breaking the stream.

All fallbacks are triggered by metrics collected in real time, not by hard‑coded timers, which keeps the system adaptive to load spikes.

What the field taught us

Token granularity matters. A single word like “yes” renders in <100 ms, but a bulleted list can exceed 1 s. Prompt engineering that front‑loads confirmation (“Sure, I can help with…”) reduces perceived latency.
Background noise is the silent killer. Even with state‑of‑the‑art STT, 30 % of calls in a call‑center environment contain kitchen or traffic noise that drops word‑error‑rate (WER) above 20 %. We mitigate by adding a short “Did I hear that right?” loop after any low‑confidence segment (confidence <0.75).
Pronouncing phone numbers. Users expect the agent to read digits naturally (“eight‑five‑zero, six‑eight‑seven”). Training the LLM on a synthetic dataset of spoken digit sequences improves fluency and reduces mis‑recognition.
Cold‑start latency. The first request to any external API incurs TLS handshake and cold‑start cost. We keep a pool of warm connections to Twilio, Deepgram, and Claude, refreshed every 5 minutes, shaving ~120 ms off the budget.

Cost profile per call

Understanding the cost curve is essential for budgeting at scale. Below is the per‑minute cost breakdown observed in our March‑2026 production run for a 3‑minute support call using Claude‑3.5 Sonnet (≈ $0.002 per 1 k tokens) and ElevenLabs TTS (≈ $0.01 per 1 k characters).

Twilio Voice + Media Stream: $0.004 per minute (charges for inbound/outbound minutes plus media stream bandwidth).
Streaming STT: $0.001 per minute on Deepgram’s streaming tier.
LLM inference: Roughly $0.015 per minute for a small‑model call; this dominates the bill.
TTS: $0.006 per minute, assuming an average of 150 characters spoken per minute.

For a typical 3‑minute interaction the total lands between $0.10 and $0.30. Scaling to 10 k calls per month pushes the LLM cost to $150‑$300, which is still an order of magnitude cheaper than a human live agent.

Monitoring and observability

Real‑time metrics are the only way to keep the latency budget intact as traffic fluctuates.

P50 / P95 first‑audio‑chunk latency: Alert if P95 exceeds 1 s; historically we saw a 3 % jump in hang‑ups when crossing that threshold.
Barge‑in success rate: Must stay >98 %; a dip indicates VAD mis‑configuration or network jitter.
Sampled transcript audit: Randomly select 5‑10 calls per day, run the transcript through a quality‑check script that flags confidence <0.7, repeated filler, or hallucinations.
Hang‑up distribution by call length: Sudden spikes in sub‑30‑second hang‑ups usually trace back to a malformed prompt or a failed fallback path.
Resource utilisation: Monitor GPU/CPU usage on the inference node; throttling leads to LLM latency spikes, which we catch via the “LLM latency guard” metric.

Choosing providers for production

The demo‑centric ecosystem often highlights the flashiest model, but reliability, SLA, and streaming support win in production.

STT: Deepgram and AssemblyAI both offer sub‑150 ms first‑partial latency with per‑minute pricing under a cent. Whisper‑based self‑hosted stacks are cheaper at scale but add operational overhead.
LLM: Claude‑3.5 Sonnet provides a good balance of cost and latency for phone‑length interactions. When latency becomes critical, Anthropic’s claude-instant variant can shave ~80 ms at the cost of reduced context length.
TTS: ElevenLabs excels at expressive voice and true streaming; Cartesia is a solid backup with comparable latency but a narrower voice catalog.
Telephony: Twilio Media Streams are the de‑facto standard because they expose raw PCM over WebSocket, essential for low‑level VAD and custom jitter handling.

Scaling considerations

When moving from a handful of pilot calls to thousands per day, two architectural dimensions dominate:

Horizontal scaling of the media gateway – Deploy the Twilio‑WebSocket bridge on a Kubernetes cluster behind a load balancer that preserves sticky sessions per call ID. Autoscale based on inbound WebSocket connections (target ≤ 200 ms connection‑setup latency).
Inference throughput – Run LLM inference on GPU‑enabled nodes (e.g., NVIDIA A100) using a serverless framework like Modal or RunPod. Keep a warm pool of containers to avoid cold start; measure “tokens per second per GPU” and provision enough GPUs to keep the queue latency under 100 ms.

We also cache short‑term LLM responses for identical queries (e.g., “What are your business hours?”) using a 30‑second TTL. Cache hits cut the LLM latency to near‑zero and reduce cost by ~15 % in our call‑center test.

Future‑proofing the stack

Voice AI is evolving rapidly. To avoid lock‑in, we abstract each component behind a thin interface (stt(), llm(), tts()) that normalises streaming semantics. When a new provider advertises a 30 ms improvement in first‑partial latency, swapping it in requires only a one‑line change in the adapter.

We also monitor emerging standards such as WebRTC audio tracks for direct peer‑to‑peer media, which could eliminate the Twilio middle‑man and shave another 50‑100 ms. Until those standards mature, Twilio remains the pragmatic choice for carrier‑grade reliability.

Building a production‑grade voice agent is less about flashy demos and more about disciplined engineering: streaming every stage, hard‑coding barge‑in, preparing layered fallbacks, and instrumenting end‑to‑end latency. The result is a system that feels instantaneous to the caller and stays within a predictable cost envelope.