Build a Voice Agent: ElevenLabs + Twilio + Claude
This is the working build write-up for Build a Voice Agent: ElevenLabs + Twilio + Claude. The structure is the same one I use when shipping for paying customers: scope, stack, walkthrough, edge cases, costs, monitoring.
One concrete deliverable. No "and then you could also..." scope creep. If a feature isn't in this scope, it isn't in this build.
Stack at a glance
- Model — picked for the dominant task in this build (see the Model Picking section below).
- Glue — minimal: a small Python or TypeScript service, a queue if needed, a key-value store.
- Hosting — whatever you already use. The build avoids platform-specific lock-in.
- Eval — yes, even for a 4-hour build. Without an eval the next change is a guess.
Walkthrough
Step 1 — define the unhappy path first
Before any code, write down what "wrong" looks like for this build. For an agent, that's the user requests it should refuse, the inputs that should pivot to a human, and the outputs that should never appear. Pin those in a small eval set. Three to five examples is enough for the first pass.
Step 2 — minimum prompt
Start from the smallest possible system prompt that passes the unhappy-path eval. Anything beyond that is a future-you problem, not a today-you problem. Resist the urge to "just add a section about ..." until the eval forces you to.
# minimum-prompt scaffolding
SYSTEM = '''
You are agents-agent.
Goal: <one sentence>.
Refuse: <list>.
When uncertain: ask one clarifying question.
'''
Step 3 — wire the tool calls
Tools are where most agent builds blow up. Two anti-patterns to avoid:
- Tools that take freeform strings and parse them server-side. Use structured args.
- Tools that throw on bad input. Have them return an error object the model can read and recover from.
Step 4 — the eval gate
Run the eval. Fix the failures. Run it again. Anything below 100% on your unhappy-path examples means you're not done — these are the cases you specifically chose because they matter.
Step 5 — observability that's actually useful
Three signals are enough to start: per-request latency, per-request token cost, and a sampled-and-reviewed transcript log. You add more once you've seen the first 100 production calls.
Real cost numbers
Cost depends on how many calls you actually make and which model you pick. The shape of the cost curve looks like this for a build like this one:
| Tier | Calls/day | Indicative monthly cost |
|---|---|---|
| Personal / dev | < 100 | Single-digit dollars on a frontier API; $0 on a self-hosted small model. |
| Small product | 100 – 5,000 | Tens to low hundreds. Caching changes this number more than model choice does. |
| Mid product | 5,000 – 100,000 | Hundreds to low thousands. This is where the small-model cascade pattern starts to pay back its complexity. |
As of writing — see the linked pricing pages above for current per-token rates.
Failure modes I hit
- Prompt that worked at 10 calls failed at 1,000. Add diversity to the eval set; an unhappy-path example you haven't seen yet is just a bug you haven't met.
- Latency spikes when the upstream provider had a degraded region. Add a second provider on the same interface, round-robin under load.
- Tool-use loops. Cap tool calls per turn. Anything past the cap returns "max_tool_calls reached" to the model.
FAQ
How long does this build actually take?
Half a day if you've done one before. A full day the first time. The eval-gate step is the unfamiliar part for most engineers, and it's also the most valuable.
Can I skip the eval?
You can skip it the first time and you'll wish you hadn't on the third change. Keep it lightweight — three to five examples — but keep it.
What's the rollback plan?
The agent endpoint is feature-flagged. If quality dips, traffic flips back to the previous prompt version. The eval set lives in source control; rolling back is a git revert.