Picking a Vector DB in 2026: A Decision Framework
The four questions that prune the field
Before you even glance at a vendor’s brochure, answer these four concrete items. They collapse the combinatorial space of vector stores into a handful of viable candidates.
- Scale. Current vector count and 12‑month growth projection. Distinguish sub‑million, 1‑10 M, and 10 M+ regimes.
- Latency budget. Measure P95 under realistic concurrency (your typical request rate, not an isolated benchmark). Note the ceiling you can tolerate before the downstream LLM call dominates.
- Filtering complexity. Pure vector similarity, heavy metadata predicates, or geospatial constraints? Identify the dominant pattern.
- Operations appetite. How many full‑time equivalents can you devote to provisioning, monitoring, upgrades, backups, and disaster recovery?
Decision matrix distilled
Sub‑million vectors, minimal filters
Use pgvector inside an existing PostgreSQL cluster. The extension adds a vector column type, indexes with IVFFlat or HNSW, and leverages the same authentication, backups, and tooling you already run. Adding a separate service for <106 vectors rarely yields a net‑benefit.
1 M–10 M vectors, light filtering
Self‑hosted Qdrant or Weaviate on a modest VM (2 vCPU, 8 GiB RAM, NVMe) typically costs $20–$60 / month on providers such as Hetzner or Vast.ai. Both expose a REST‑ful /search endpoint, support HNSW out‑of‑the‑box, and provide optional ONNX runtime acceleration. In our experience, pgvector begins to show >30 % slowdown around the 5 M‑vector mark unless you manually tune nlist and ef_construction.
1 M–10 M vectors, heavy metadata filtering
Qdrant’s “filterable HNSW” index lets you combine arbitrary boolean predicates with vector similarity in a single pass. This is essential when you need to enforce tenant isolation, product version gating, or time‑range cuts. Weaviate offers a similar GraphQL filter layer, but its performance degrades noticeably when the filter cardinality exceeds a few hundred thousand rows.
10 M+ vectors
At this scale the operational overhead of self‑hosting outweighs raw cost. Managed services—Qdrant Cloud, Pinecone, Weaviate Cloud—provide auto‑scaling, multi‑zone replication, and built‑in monitoring. Their pricing tiers start around $500 / month for 10 M vectors with guaranteed <10 ms P95 latency. Choose the provider whose SLA aligns with your availability requirements; the underlying algorithmic differences are marginal for most production workloads.
Multi‑tenant SaaS pattern
Namespaces (Pinecone) or collection isolation (Qdrant Cloud) let you partition data per customer without spinning up separate clusters. Self‑hosted multi‑tenant architectures demand per‑tenant resource quotas, quota‑aware routing, and a robust RBAC model—effectively a dedicated devops team. In practice, we have seen teams burn 0.5–1 FTE per month just to keep tenant isolation safe.
Geospatial‑aware search
If you need to combine distance‑based filters with vector similarity, pair pgvector with PostGIS. The ST_DWithin predicate runs before the ANN lookup, keeping the candidate set small. Qdrant now ships a geo_filter primitive that indexes latitude/longitude in a separate HNSW layer; it performs well up to a few hundred thousand geo points. Pinecone’s geo support is experimental and currently lacks radius‑based queries.
Common optimisation rabbit holes
What teams obsess over rarely moves the needle in production.
- Recall@1 on synthetic benchmarks. The 0.1 % gain you see on MS MARCO or FAISS evals disappears once you factor in domain‑specific tokenisation and query reformulation.
- Micro‑second raw lookup latency. When the downstream LLM call costs 30–150 ms, a 5 ms vector fetch is negligible. Focus on the 95th‑percentile tail instead.
- “State‑of‑the‑art” algorithms. HNSW and IVFPQ dominate the performance‑quality frontier. Newer graph‑based methods (e.g., DiskANN, ScaNN) offer marginal gains at the cost of added complexity.
Under‑measured factors that decide winners
These operational dimensions surface only after weeks of real usage.
- Re‑index time. Switching embedding models (e.g., from OpenAI
text-embedding-3-largetoe5-base) forces a full rebuild. In our deployments, a 10 M‑vector corpus reindexes in 2–4 hours on a singlec5.4xlargewith parallel ingestion. - Backup & restore workflow. Verify point‑in‑time recovery. Qdrant Cloud offers automated snapshots; self‑hosted Qdrant requires you to script
rsyncof thestoragedirectory and test restores on a staging node. - Local developer experience. A Docker‑compose setup for
pgvectorruns in under 30 seconds on a laptop, whereas Qdrant needs a separateqdrant/qdrantimage with persisted volumes. Faster iteration reduces friction in early‑stage experimentation. - Schema agility. Adding a new metadata column should not entail a migration lock. PostgreSQL’s
ALTER TABLE … ADD COLUMNis instantaneous on modern SSDs; Qdrant’s schema is immutable per collection, requiring a new collection and data copy.
The 15‑minute sanity check
Turn theory into data with a quick, reproducible experiment.
- Export 1,000 representative vectors from your production pipeline (include a mix of recent and legacy embeddings).
- Spin up three candidates:
pgvectoron a local Postgres, Qdrant Docker, and a managed Pinecone sandbox. - Index the sample set, then fire 100 real‑world queries (the exact payloads you expect in production).
- Record P95 latency, recall on a hand‑labelled relevance set, and ops overhead (setup time, required config files, monitoring hooks).
- If two solutions converge on quality and latency, pick the one with the smallest ops surface. If one fails dramatically, eliminate it before you invest weeks of engineering.
Scaling beyond the matrix
When you outgrow the sweet spot of the matrix—say you hit 50 M vectors or need sub‑5 ms P95 at 10 k QPS—you must revisit two levers.
- Hybrid indexing. Combine a coarse‑grained IVF layer (e.g.,
nlist=8192) with a fine‑grained HNSW per‑cell. Both Qdrant and Weaviate expose this pattern via theirquantizationAPI. - Sharding strategy. Partition by tenant, geographic region, or hash of the vector ID. Managed clouds handle sharding transparently; self‑hosted deployments need a routing layer (e.g.,
Envoy+ consistent hashing) to keep latency predictable.
Cost‑vs‑performance sanity check
Translate the matrix into a dollar figure for your budget holder.
- Estimate monthly vector count and growth (e.g., 8 M now, +20 %/mo).
- Pick the tier that satisfies your latency budget (e.g., Qdrant Cloud “Standard” at $0.30 per GB‑month).
- Add ops overhead: 0.2 FTE for monitoring, 0.1 FTE for backups, 0.05 FTE for CI/CD integration.
- Sum hardware, SaaS, and labour. In our recent client, the total landed at $1,200 / month—30 % less than the initial “Pinecone‑only” estimate, with equal latency.
Final checklist before you commit
Run through this list after the 15‑minute test and before signing any contract.
- Does the service expose health endpoints compatible with your existing observability stack (Prometheus, Grafana)?
- Is TLS‑terminated at the edge and does the API support IAM‑style token rotation?
- Can you snapshot the index without downtime?
- Is the client library mature for your language (Python
qdrant-client>=0.10, Gopinecone-go>=0.3)? - Does the vendor publish a transparent SLA for availability and query latency?
Answering these questions gives you a defensible recommendation in under a quarter hour, sparing weeks of trial‑and‑error and keeping your vector search stack aligned with real production constraints.
This is part of the X cornerstone series