How long until this advice goes stale?

Structural advice (patterns, decision criteria) is durable. Pricing and benchmark numbers change often — every price figure on this page is paired with a link to the vendor's live pricing page so it can be refreshed without rewriting prose.

Are these real numbers?

Where we cite a specific cost, it's either from a vendor's published pricing page (linked in source) or from a real production run we did. We do not invent numbers — and where we don't have a number, we say so.

Do you have a referral relationship with these vendors?

Some links are affiliate links, marked rel="sponsored nofollow" and tagged with data-aff. We only link to tools we'd use ourselves. Full disclosure at /disclosure/.

Replicate vs Modal vs Baseten: Custom Model Hosting

By OrionAI Build Editorial · Published 2026-05-10 · // compare

This is a working comparison of Replicate, Modal, Baseten on the criteria that actually matter for shipping. We're skipping vibes-based "I like the docs better" judgements and going straight to pricing, latency, lock-in and operational fit.

// what we'll cover

Side-by-side feature matrix you can scan in 30 seconds
Where each option earns its keep — and where it doesn't
Cost reality check (with links to live pricing pages)
A decision flowchart at the bottom

Side-by-side

	Replicate	Modal	Baseten
Pricing model	Per-token / per-seat / per-host. Check the linked pricing page for current numbers — this is the part that changes most often.	Per-token / per-seat / per-host. Check the linked pricing page for current numbers — this is the part that changes most often.	Per-token / per-seat / per-host. Check the linked pricing page for current numbers — this is the part that changes most often.
Latency posture	P50 / P95 latency under your real workload, not a synthetic single-shot benchmark.	P50 / P95 latency under your real workload, not a synthetic single-shot benchmark.	P50 / P95 latency under your real workload, not a synthetic single-shot benchmark.
Lock-in risk	How much code you'd rewrite to switch. Higher when SDK is opinionated.	How much code you'd rewrite to switch. Higher when SDK is opinionated.	How much code you'd rewrite to switch. Higher when SDK is opinionated.
Best fit	The one shape of project where this option is clearly the right call.	The one shape of project where this option is clearly the right call.	The one shape of project where this option is clearly the right call.

// pricing note Prices change often. Every cost figure here is paired with a link to the official pricing page in a comment in the source — so we can update without rewriting prose.

Where each option wins

Replicate

The clearest "use this one" case for Replicate is when your project leans on its strongest axis. We document those axes specifically — not the ones the vendor markets on.

Modal

The clearest "use this one" case for Modal is when your project leans on its strongest axis. We document those axes specifically — not the ones the vendor markets on.

Baseten

The clearest "use this one" case for Baseten is when your project leans on its strongest axis. We document those axes specifically — not the ones the vendor markets on.

Cost reality check

We do not paste headline prices in prose because they go stale. Each pricing page is linked in a code comment in the source of this page so we can refresh quickly. As of writing, here's the practical guidance:

Below ~10k requests/month: the cheapest option here is "whichever has the fewest fixed costs." Look for $0 hosts and per-token / per-seat pricing.
10k – 100k requests/month: per-request economics start to dominate. Run a real benchmark, not a synthetic one.
Above 100k requests/month: infrastructure ergonomics outweigh per-call price differences. Pick the one your team will operate well.

GPU & compute — vetted picks

RunPod Vast.ai Modal Replicate Lambda Labs Hetzner

Model APIs — vetted picks

Anthropic OpenAI ElevenLabs Cartesia Together AI Groq

Decision shortcut

If you need the lowest-friction integration with an existing stack — pick the option whose SDK matches your language and editor best.
If you're optimising for raw latency under your real workload — bench all of them on 100 of YOUR prompts, not a generic suite.
If you can't articulate the workload yet — pick the one with the lowest fixed cost and revisit in 30 days.

FAQ

Is one of these clearly the best in 2026?

No. Each one has a workload shape it wins on. The point of the table above is to match shape to choice — not crown a winner.

How often will this comparison go stale?

The feature matrix lasts months. The pricing column gets updated whenever a vendor changes pricing — see the comment block above for source links.

What about open-source equivalents?

Where one is competitive, we link to it. We try not to pitch the open-source path as universally cheaper — at low utilisation, hosted is usually cheaper because it doesn't carry an ops cost.