Hosting Open-Source LLMs: vLLM on a $20/mo Box, Real Benchmarks
Defining “cheap” for production agents
When we say “cheap” we’re not talking about hobbyist tinkering. The target is a model that fits comfortably in 6‑8 GiB of VRAM, delivers deterministic latency, and can be wrapped in an OpenAI‑compatible endpoint for downstream services. In practice that means models in the 1 B‑3 B parameter band—Gemma 3 1B, Phi‑4 mini, Qwen 2.5 1.5B, or any of the Hugging Face “unsloth” variants that have been quant‑aware fine‑tuned for inference.
These models excel at narrow, high‑throughput tasks: binary classification, entity extraction, routing prompts to specialist agents, or answering static FAQs. Anything that requires deep chain‑of‑thought reasoning or world‑modeling still falls back to a commercial API (Claude, GPT‑4‑Turbo, etc.). By keeping the workload in the “narrow” bucket we avoid the exponential cost curve of frontier models while still delivering sub‑second responses for the majority of user‑facing calls.
Hardware shape that survives spot markets
vLLM’s memory planner requires roughly parameter_count × 4 bytes ÷ 2 for the model weights plus overhead for KV cache. A 1.5 B model therefore sits at about 7 GiB when loaded in 16‑bit float16, leaving a 1‑GiB safety margin on an 8‑GiB card. That margin is critical because spot instances on providers such as RunPod, Vast.ai, or Lambda Labs can be reclaimed with as little as 30 seconds notice.
Our cost model assumes a spot tier priced at $0.12 / hour (mid‑range of the $0.10–$0.20 range we observed in March 2026). Running the box 5 hours per day—enough to cover business‑hours traffic, nightly batch jobs, and a modest overnight dev window—yields a monthly bill of $18.00. The remaining budget can be allocated to storage (a 50 GiB persistent volume for model weights, $2–$3 / month) and a minimal outbound data allowance.
Key hardware checklist:
- GPU: NVIDIA RTX 3060 / 3060 Ti, or equivalent AMD MI‑50 in the spot pool.
- CPU: 4‑core, 8‑thread Intel Xeon E‑2288G or AMD EPYC 7302; vLLM is GPU‑bound, so CPU is not a bottleneck.
- RAM: 16 GiB system memory; enough to hold the model, the vLLM process, and a small OS buffer.
- Disk: 100 GiB NVMe SSD; fast enough for a one‑time model pull and subsequent cache reuse.
vLLM configuration that ships to production
Below is the command line we use in all of our CI pipelines. It is deliberately conservative: we trade a few percent of raw throughput for predictability and faster cold‑starts.
vllm serve unsloth/gemma-3-1b-it \
--enforce-eager \
--max-num-seqs 4 \
--max-model-len 1024 \
--gpu-memory-utilization 0.92 \
--port 8000 \
--disable-log-requests
Flag rationale:
--enforce-eagerdisables CUDA‑graph caching. The trade‑off is a ~5 % slowdown per token, but the model becomes instantly ready after a hot‑swap, which is vital when spot instances are killed and restarted.--max-num-seqs 4caps concurrent generation streams. Empirically we saw OOM spikes at 5 concurrent streams on an 8 GiB card because the KV cache for longer prompts (≈1 k tokens) consumes roughly 1.2 GiB per sequence.--max-model-len 1024matches the typical prompt length for classification‑oriented pipelines. Extending to 2 k tokens doubles KV memory and pushes us into the OOM regime without additional VRAM.--gpu-memory-utilization 0.92leaves a 8 % safety buffer for OS daemons, container overhead, and occasional cache spikes.--disable-log-requestsremoves per‑request logging from the hot path, shaving ~0.2 ms off the P95 latency.
Real‑world benchmark suite
We benchmarked three representative workloads on a freshly‑started spot instance (cold‑start excluded). All numbers are median values across a 30‑minute run, with the 95th percentile (P95) shown for latency.
- Single caller, 100‑token output: 85 tokens / second throughput, P95 latency ≈ 1.4 s.
- Four concurrent callers, 100‑token output each: 210 tokens / second aggregate, P95 latency ≈ 2.1 s per request.
- Single caller, 256‑token output: 80 tokens / second, P95 latency ≈ 3.4 s.
Cold‑start cost is dominated by model load time: ~8 seconds from container start to first token. In production we mitigate this by keeping a “warm‑up” daemon that reloads the model a few minutes before the expected traffic window (e.g., 07:00 UTC for North‑America users).
Failure modes you’ll encounter
- Spot pre‑emption. Providers typically give a 30‑second warning via a SIGTERM. Our wrapper script traps the signal, flushes pending requests, and re‑issues a
vllm servecommand on a newly allocated instance. Client libraries implement exponential back‑off and automatic retry, so end‑users see no error beyond a brief “service unavailable”. - Out‑of‑memory bursts. Even with
--max-num-seqsset, a sudden influx of long prompts can exceed the KV cache budget. The solution is a lightweight front‑end queue (Redis Lite or a simple in‑memory FIFO) that throttles new requests when the vLLM metrics endpoint reportsgpu_memory_usage > 0.88. - Hugging Face download throttling. The first pull of a model can hit the 10 GiB/hour rate limit on the public hub. We bake the model into a Docker layer during CI, store the resulting image in a private registry (GitHub Packages or GitLab Container Registry), and launch instances from that pre‑populated image. For teams that cannot host private images, a one‑time
gsutil cpto a bucket attached as a persistent volume sidesteps the rate limit. - Network jitter. Spot instances on shared backbones sometimes experience 150 ms round‑trip spikes. By enabling
--disable-log-requestsand keeping the HTTP payload under 2 KB we keep total response time under 2 s even in the worst‑case network tail.
When self‑hosting is the wrong answer
If you average fewer than 500 requests per day, the operational overhead—monitoring, spot‑recovery scripts, Docker image management—outweighs the raw compute cost. In our own cost‑analysis a 500‑request/day workload cost $12 / month in compute but required ~8 hours of engineering time per month to maintain, translating to an effective $150 / month when accounting for senior‑engineer rates.
The breakeven point we observed sits near 5 k requests per day for 1 B‑scale models, assuming a 70 % cache hit ratio and a 95 % success rate after spot pre‑emptions. Above that threshold the marginal cost of an API call ($0.002 per 1 k tokens on most commercial endpoints) eclipses the $20/month box, making self‑hosting the clear winner.
Production stack checklist
- Model: Gemma 3 1B (unsloth), quantized to fp16 for maximal GPU utilization.
- Inference server: vLLM 0.4.2, launched with the flags listed above; exposes an OpenAI‑compatible REST endpoint on port 8000.
- Compute provider: RunPod spot tier, 8 GiB VRAM instance (e.g.,
gpu.8xlarge), $0.12 / hour. - Containerization: Dockerfile that copies the pre‑downloaded model weights into
/models, setsENV VLLM_MODEL_PATH=/models/gemma-3-1b-it, and runs the vLLM command asENTRYPOINT. - Orchestration: Simple systemd service inside the container that restarts the vLLM process on SIGTERM, coupled with a host‑level watchdog that relaunches the container on spot termination.
- Client SDK: Any OpenAI‑compatible library (Python
openai, Nodeopenai, or curl) withbase_url="http://. Authentication can be a static bearer token passed in the:8000/v1" Authorizationheader; no need for OAuth flows. - Observability: Prometheus exporter built into vLLM (exposed at
/metrics) scraped by a Grafana dashboard; alerts ongpu_memory_usage > 0.9and on spot‑termination events.
Scaling beyond a single box
When traffic exceeds the 4‑sequence limit, the cheapest path is horizontal scaling: spin up an identical spot instance and load‑balance at the HTTP layer. Because vLLM endpoints are stateless (the KV cache resides in GPU memory only for the lifetime of a request), a round‑robin NGINX or Envoy proxy can distribute traffic without session affinity.
Key considerations for a multi‑box fleet:
- Maintain a shared
model_weightsvolume (e.g., an S3‑backed EFS mount) so that new instances start with the model already cached. - Use a central Redis queue for back‑pressure; each worker pops a request, processes it, and pushes the result to a response channel.
- Watch for “herd‑effect” pre‑emptions: providers sometimes reclaim an entire availability zone. Deploy across at least two zones to guarantee 99.5 % uptime.
Cost monitoring and alerts
We instrumented a lightweight Lambda function that polls the provider’s billing API every hour. When projected monthly spend exceeds $25 we receive an email and automatically scale back concurrency to --max-num-seqs 2. In practice this safeguard prevented a runaway burst on a promotional weekend when a mis‑configured crawler generated 12 k requests in a single hour.
Tip: because spot pricing fluctuates, embed a price_limit flag in your instance launch script (e.g., --max-price 0.13 on RunPod). The scheduler will refuse to allocate a box if the market price spikes, protecting you from unexpected spikes.
Future‑proofing with model upgrades
Upgrading from a 1 B to a 2.8 B model usually requires only a VRAM bump to 12 GiB. Providers now offer cheap gpu.a10x instances at $0.18 / hour with 12 GiB memory. Our benchmark suite shows a 1.8× increase in throughput for the same concurrency level, while latency stays within the 2‑second P95 envelope for 100‑token outputs.
Because the vLLM command line is declarative, swapping the model path is a single environment variable change. The only operational impact is a longer cold‑start (≈12 s for the 2.8 B model) and a modest increase in spot‑pre‑emption risk due to the higher hourly price. If your request volume has already crossed the 10 k / day threshold, the trade‑off is worthwhile.
Bottom line for production agents
Running an OpenAI‑compatible endpoint on a $20‑per‑month spot box is not a gimmick; it’s a reproducible, auditable pattern that delivers sub‑2‑second latency for narrow‑task agents at scale. The sweet spot lies in 1 B‑3 B models, a disciplined concurrency cap, and a robust retry‑on‑pre‑empt script. When traffic surges past a few thousand daily calls, horizontal scaling preserves latency without blowing the budget. For everything else—low volume, high‑risk workloads, or tasks that demand deep reasoning—stick with a commercial API and let the cheap box do the grunt work.