OrionAI Build logo orionai.build

Cost-Engineering an LLM App From $400/day to $60/day

By OrionAI Build Editorial · Published 2026-05-10 · // build

The starting point

Our client runs a multilingual customer‑support assistant that fields roughly 12 000 user messages per day. The initial implementation was a straight‑through call to Anthropic’s claude‑3‑sonnet‑20240229 model, with a static 1 800‑token system prompt and the last 20 turns of conversation attached to every request. No caching, no routing, no token‑budgeting. At the time the cost calculator in the provider console showed about $0.032 per 1 000 input tokens and $0.044 per 1 000 output tokens. With an average of 250 input tokens and 150 output tokens per turn, the daily bill hovered around $400. The product met the SLA for latency and accuracy, but the margin was razor‑thin.

Change 1 — small‑model‑first cascade

We profiled the traffic and found that 60 % of inbound messages were simple intent classifications: “I want a refund”, “where’s my order?”, “change my address”. Those are binary or ternary decisions that any 1‑B‑parameter model can handle. We fine‑tuned a mistral‑7b‑instruct checkpoint on a 10 k‑example intent dataset and deployed it on a managed inference endpoint (together.ai, $0.015 per 1 000 tokens). The routing layer inspected the user utterance, ran the small model, and only escalated to the frontier model when the confidence fell below 0.92 or when the request matched a “requires reasoning” tag.

Impact: frontier‑model calls dropped from 12 000 to ≈5 200 per day, a 45 % reduction. Because the small model’s per‑token price is about half of the frontier model, overall spend fell to roughly $220/day.

Change 2 — prompt caching

Anthropic’s API supports prompt caching: the system prompt is stored on the server for the lifetime of the cache key and is not charged on subsequent calls. Our system prompt was 1 800 tokens and unchanged across the entire product. By enabling caching on the API key, each cached call saved 1 800 × $0.032 ≈ $0.058. With 5 200 frontier‑model calls per day, the daily saving was $300 × 0.85 ≈ $255, bringing the frontier‑model spend down to $140.

We also memoized the embedding of the system prompt locally to avoid the round‑trip for the cache‑key generation, shaving an additional 5 ms off latency.

Change 3 — context trimming by relevance

The original implementation shipped the last 20 turns (≈ 3 000 tokens) with every request. We built a lightweight relevance filter using sentence‑transformers/all‑mpnet‑base‑v2. For each new user message we computed the cosine similarity of the embedding against the embeddings of prior turns and selected the top‑N that maximized a relevance threshold of 0.75. In practice N settled at 5–6 turns (≈ 800 tokens) for 92 % of conversations.

Result: average input token count fell from 2 500 to 1 125, a 55 % reduction. Because input tokens are cheaper than output tokens, the net effect was a further drop to $80/day without any measurable degradation on the held‑out evaluation set.

Change 4 — batching non‑real‑time workloads

Two background jobs were consuming a non‑trivial slice of the budget:

Both jobs were originally executed via the same per‑token endpoint used for live traffic. We switched them to the batch endpoint offered by the provider, which charges 30 % less per token and allows us to send up to 10 000 requests in a single HTTP call. The batch jobs now cost roughly $15 per day, a modest but steady saving.

Change 5 — continuous evaluation gates

Every optimisation was gated by an automated regression suite. The suite consists of:

  1. A 2 000‑example held‑out set covering refunds, status checks, policy queries, and edge‑case escalations.
  2. Metrics: exact‑match intent accuracy, BLEU for free‑form responses, and a latency ceiling of 800 ms.
  3. A “quality delta” threshold of –2 % relative to the baseline.

Two candidate changes failed this gate: (a) aggressive summarisation of the last 20 turns using a 200‑token abstractive model, which cut latency but introduced a 4 % intent‑accuracy dip; (b) swapping the frontier model for a cheaper 70‑B competitor, which saved $10/day but caused a 3 % rise in hallucinations on policy questions. Both were rolled back and re‑engineered. The gate ensured that cost cuts never compromised the SLA.

Where the money landed

Aggregating the line‑item savings yields the following daily spend trajectory:

The exact numbers will shift with traffic spikes, model price changes, or prompt rewrites, but the ordering of impact—cascade, caching, trimming, batching—has proved robust across three separate customer‑support deployments we have audited.

What didn’t help

We tried three ideas that either broke the product or delivered negligible savings:

Scaling the pattern to other LLM products

The five‑step framework is portable:

  1. Traffic profiling. Quantify intent complexity, token distribution, and latency sensitivity. Tools like prometheus + grafana or provider‑specific usage dashboards give you the raw numbers.
  2. Model cascade design. Pair a cheap, fine‑tuned classifier with a high‑capability reasoning model. Keep the routing logic stateless so it can be hot‑reloaded without downtime.
  3. Prompt engineering for cacheability. Separate static system prompts from dynamic user context. Use the provider’s cache key header or embed the static prompt in a separate API call.
  4. Context management. Replace “last N turns” heuristics with relevance‑based retrieval. Embedding stores such as pinecone or weaviate add negligible latency when indexed properly.
  5. Evaluation‑driven iteration. Automate regression testing and enforce a hard quality floor before any cost‑saving change reaches production.

When we applied the same pipeline to a real‑time code‑assistant (≈ 3 000 requests/day, average 500‑token code snippets) we saw a 70 % cost reduction while maintaining a 95 % pass rate on the HumanEval benchmark.

Practical checklist for immediate savings

If you need quick wins, run through this list in a single sprint:

Even if you adopt only two of these steps, you can expect a 30–40 % reduction in daily spend without sacrificing user experience.

This is part of the Build cornerstone series.