OrionAI Build logo orionai.build

Voice Agents at Scale: ElevenLabs + Twilio + Claude

By OrionAI Build Editorial · Published 2026-05-10 · // build

The latency budget

Human conversation tolerates a pause of roughly 600‑800 ms before the listener assumes the speaker has finished. In a phone‑based AI agent that window shrinks: the caller expects the system to feel present, not a robot. Our target is ≤800 ms from the moment the caller’s voice hits the microphone to the first audible response. Anything slower feels disjointed and drives abandonment.

Breaking the budget into stages makes the problem tractable. In our production runs the following targets hold:

The critical insight: the pipeline must be fully streaming. Any synchronous, blocking call—e.g., a “batch” STT request that waits for the utterance to end—adds hundreds of milliseconds and busts the budget.

The end‑to‑end pipeline

  1. Twilio Media Stream over WebSocket – raw 16 kHz PCM frames flow from the caller to our edge service. We keep the WebSocket alive for the call duration to avoid reconnection overhead.
  2. Streaming speech‑to‑text – each frame is forwarded to the STT provider. Partial transcripts are emitted every ~100 ms, allowing downstream components to act before the user finishes speaking.
  3. Endpoint detection – a lightweight voice‑activity detector (WebRTC VAD) flags silence. We augment it with a tiny classifier (“are they still talking?”) that looks at prosody to decide whether to wait for a continuation or to cut to the LLM.
  4. LLM inference (streamed) – as soon as the VAD signals a pause, we send the buffered transcript to Claude‑3.5 Sonnet with stream=true. Tokens arrive in 20‑40 ms bursts; the first token triggers TTS.
  5. Streaming text‑to‑speech – the first token is handed to ElevenLabs TTS. The service immediately returns an audio chunk that we pipe back into Twilio’s Play stream, overlapping with the rest of the LLM response.

Each stage runs in its own async task, coordinated by a small state machine that tracks “listening”, “thinking”, and “speaking”. The state machine guarantees that a later stage never blocks an earlier one.

Barge‑in handling

Interruptibility is non‑negotiable. If the caller interjects while the agent is speaking, the system must cease playback within 200 ms to avoid a jarring overlap.

Fallback paths and resiliency

Production telephony cannot afford a single point of failure. We built three orthogonal fallback layers:

All fallbacks are triggered by metrics collected in real time, not by hard‑coded timers, which keeps the system adaptive to load spikes.

What the field taught us

Cost profile per call

Understanding the cost curve is essential for budgeting at scale. Below is the per‑minute cost breakdown observed in our March‑2026 production run for a 3‑minute support call using Claude‑3.5 Sonnet (≈ $0.002 per 1 k tokens) and ElevenLabs TTS (≈ $0.01 per 1 k characters).

For a typical 3‑minute interaction the total lands between $0.10 and $0.30. Scaling to 10 k calls per month pushes the LLM cost to $150‑$300, which is still an order of magnitude cheaper than a human live agent.

Monitoring and observability

Real‑time metrics are the only way to keep the latency budget intact as traffic fluctuates.

Choosing providers for production

The demo‑centric ecosystem often highlights the flashiest model, but reliability, SLA, and streaming support win in production.

Scaling considerations

When moving from a handful of pilot calls to thousands per day, two architectural dimensions dominate:

  1. Horizontal scaling of the media gateway – Deploy the Twilio‑WebSocket bridge on a Kubernetes cluster behind a load balancer that preserves sticky sessions per call ID. Autoscale based on inbound WebSocket connections (target ≤ 200 ms connection‑setup latency).
  2. Inference throughput – Run LLM inference on GPU‑enabled nodes (e.g., NVIDIA A100) using a serverless framework like Modal or RunPod. Keep a warm pool of containers to avoid cold start; measure “tokens per second per GPU” and provision enough GPUs to keep the queue latency under 100 ms.

We also cache short‑term LLM responses for identical queries (e.g., “What are your business hours?”) using a 30‑second TTL. Cache hits cut the LLM latency to near‑zero and reduce cost by ~15 % in our call‑center test.

Future‑proofing the stack

Voice AI is evolving rapidly. To avoid lock‑in, we abstract each component behind a thin interface (stt(), llm(), tts()) that normalises streaming semantics. When a new provider advertises a 30 ms improvement in first‑partial latency, swapping it in requires only a one‑line change in the adapter.

We also monitor emerging standards such as WebRTC audio tracks for direct peer‑to‑peer media, which could eliminate the Twilio middle‑man and shave another 50‑100 ms. Until those standards mature, Twilio remains the pragmatic choice for carrier‑grade reliability.

Building a production‑grade voice agent is less about flashy demos and more about disciplined engineering: streaming every stage, hard‑coding barge‑in, preparing layered fallbacks, and instrumenting end‑to‑end latency. The result is a system that feels instantaneous to the caller and stays within a predictable cost envelope.