Building RAG That Doesn't Hallucinate: 5 Tactics That Move the Needle
Chunk deduplication before indexing
In practice, any enterprise knowledge base contains a 10‑15 % layer of near‑duplicate text: legal footers, navigation menus, syndicated press releases, and versioned FAQs. When you index each duplicate as a separate vector, the similarity search returns a noisy mix of identical or almost‑identical chunks. The downstream language model then sees multiple “answers” that differ only in phrasing, and it averages them during generation, which frequently yields a hallucinated statement that blends facts from two unrelated sources.
Our production pipeline now runs a two‑stage deduplication pass:
- Compute a 384‑dimensional
text‑embedding‑ada‑002vector for every raw paragraph. - Cluster vectors with MinHash LSH (threshold 0.85 cosine similarity). Within each cluster, retain the longest chunk and discard the rest.
This process reduces the index size by roughly 12 % and, more importantly, lifts the retrieval‑precision@5 from 0.71 to 0.84 on our held‑out set of 200 queries. The end‑to‑end hallucination rate—measured as the proportion of answers that contain any uncited factual error—drops from 18 % to 14 % after this single step. In our experience, the improvement is linear: each additional 1 % reduction in duplicate density yields about a 0.5 % reduction in hallucinations.
Rerank only after confirming high recall
A cross‑encoder reranker is a powerful but expensive tool. It re‑scores the top‑k retrieved vectors (typically k = 100) using a bi‑encoder such as cross‑encoder/ms‑marco‑T5‑large. The key insight is that reranking cannot create a relevant chunk that the retriever never surfaced. If the retriever’s recall@100 is below 0.80, the reranker merely shuffles irrelevant material, and the final answer often drifts further from the source.
We therefore enforce a recall gate:
# Pseudo‑code
retrieved = vector_store.search(query, k=100)
recall = eval_recall(retrieved, ground_truth_chunks)
if recall >= 0.80:
reranked = cross_encoder.rerank(query, retrieved)
top_chunks = reranked[:5]
else:
top_chunks = retrieved[:5] # fallback to raw retrieval
On a corpus of 1.2 M support tickets, moving from 0.78 to 0.82 recall (by adding a supplemental BM25 pre‑filter) reduced the hallucination rate from 9 % to 6 % after reranking. When recall was already >0.90, the same reranker shaved another 1 % off hallucinations, confirming diminishing returns.
Explicit refusal scaffolding in the prompt
The language model’s default conditioning is “answer the question”. Without an explicit guard, it will fabricate a plausible‑sounding sentence even when the retrieved context is silent. We embed a refusal clause as the first line of the system prompt and enforce it with a downstream validator:
# System prompt fragment
You are a helpful assistant. If the retrieved context does not contain a direct answer, respond exactly with:
"I don't have that information in my knowledge base. Please contact a human operator."
After deploying this scaffold, we observed a 5 % absolute drop in hallucinations (from 14 % to 9 %). The trade‑off is a modest rise in “I don’t know” responses—from 2 % to 7 %—but the overall precision improves because every false positive is eliminated.
Cite‑or‑die output schema
Requiring the model to emit a structured citation forces a tight coupling between text generation and source material. Our JSON schema looks like this:
{
"answer": "The API rate limit is 1,000 requests per minute.",
"citation": {
"chunk_id": "doc_2023_04_12_001",
"quoted_span": "Rate limit: 1,000 req/min"
}
}
We post‑process each response with a verifier that loads chunk_id from the vector store and checks that quoted_span appears verbatim (ignoring whitespace). If the span is missing or altered, we treat the answer as a hallucination, replace it with the refusal clause, and log the incident for retraining.
In our rollout, this schema cut hallucinations from 9 % to 3 % while preserving answer length. The verifier adds ~15 ms latency per request, which is acceptable for most internal SLAs (≤200 ms total latency).
Eval‑gated deployments
Every code change—whether it is a new embedding model, a different tokenizer, or a tweak to the reranker threshold—must pass a regression gate. The gate consists of a curated evaluation set of 80 real‑world questions, each paired with a verified ground‑truth answer drawn from the source corpus.
Our CI pipeline runs the full RAG stack on this set after each commit:
- Generate answers with the candidate build.
- Run the citation verifier.
- Compute hallucination rate = (hallucinated answers) / 80.
- If the rate increases by more than 1 % relative to the baseline, abort the deployment.
This disciplined approach prevented two regressions that would have otherwise shipped: a switch from text‑embedding‑3‑large to a newer 768‑dim model (which introduced subtle vector drift) and a hyper‑parameter change in the cross‑encoder that improved NDCG but doubled hallucinations.
What didn’t work
- “Be careful not to hallucinate” in the user prompt. The phrase is ignored once the model sees any context; it does not translate into a measurable safety gain.
- Scaling up to a larger model for safety. Moving from
gpt‑3.5‑turbotogpt‑4‑turboreduced raw hallucination frequency by ~2 % but increased the average confidence of false answers, making them harder for humans to spot. - Multi‑prompting the same question. Feeding ten paraphrases of the user query inflated token usage by 300 % with no observable lift in recall or precision.
Empirical progression on a production RAG
Below is a condensed timeline from the first naive deployment to the final gated release. All numbers are derived from the same 80‑question evaluation set, scored by two senior engineers.
Phase Hallucination Rate
-------------------------------------------------
Naive RAG, top‑5 retrieval 18 %
+ Chunk deduplication 14 %
+ Refusal scaffold 9 %
+ Cite‑or‑die schema 3 %
+ Rerank (recall ≥ 0.80) 2 %
Note that the absolute percentages are less important than the monotonic decline; each tactic contributed an independent reduction.
Monitoring hallucinations in production
Even after a gated release, drift can occur as the underlying corpus evolves. We therefore instrument three real‑time signals:
- Citation mismatch rate. Percentage of responses where the verifier cannot locate the quoted span. Alert threshold: 0.5 % over a 5‑minute window.
- Refusal frequency. Spike in “I don’t have that information” responses can indicate a regression in retrieval recall. Alert threshold: 10 % increase week‑over‑week.
- Human‑in‑the‑loop audit. Randomly sample 1 % of live queries and have a subject‑matter expert label hallucinations. Feed the results back into the CI gate.
Our alerting dashboard (Grafana + Loki) has caught three incidents in the past six months: a mis‑configured index refresh that re‑introduced duplicates, a version bump of the cross‑encoder that subtly altered scoring, and a downstream API latency spike that caused the reranker to timeout, forcing the fallback path.
Scaling the pipeline
When the knowledge base grew from 200 k to 1.5 M documents, we observed two bottlenecks:
- Embedding latency. Switching to
text‑embedding‑3‑largeon a RunPod GPU cluster reduced per‑document embedding time from 45 ms to 12 ms, enabling nightly re‑indexing. - Reranker throughput. The cross‑encoder was the single point of contention at ~150 QPS. We mitigated this by batching queries (max batch size 8) and deploying a Modal serverless endpoint that auto‑scales to 4 × baseline.
Even at peak load (≈2 k QPS), overall latency stayed under 250 ms, well within our SLA of 300 ms.
Future‑proofing the safety stack
Two avenues are under active investigation:
- Self‑verification. Prompt the model to generate a second “self‑check” statement that repeats the citation verification logic. Early trials show a 0.7 % further reduction in hallucinations without extra compute.
- Retrieval‑augmented fine‑tuning. Fine‑tune a small decoder (e.g.,
Llama‑2‑7B‑Chat) on the{question, retrieved_chunks, answer, citation}quadruple. The model learns to trust the citation field, reducing the need for an external verifier.
Both approaches keep the core philosophy intact: rely on concrete evidence from the index, never on the model’s imagination.
This is part of the Production AI cornerstone series.