Building RAG That Doesn't Hallucinate: 5 Tactics That Move the Needle
Five tactics that actually moved hallucination rate down on a real production RAG system. Numbers, code, and the tactics that didn't work.
Tactic 1 — chunk dedup before indexing
Most corpora have near-duplicates. Boilerplate footers. Re-published articles. Multiple versions of the same FAQ. Each duplicate chunk dilutes retrieval — the right answer gets averaged with the wrong one. Run a near-dup pass at index time (MinHash or simple cosine on embeddings) and drop duplicates. On the corpus where I first ran this, hallucination rate dropped 18% just from this step.
Tactic 2 — rerank, but only when retrieval recall is high
Reranking only helps if your top-100 retrieval contains the right answer. If it doesn't, reranking just changes which wrong chunk wins. Measure retrieval recall@100 on a held-out set first. If it's below 0.8, fix retrieval before adding a reranker. Above 0.8, a small cross-encoder reranker can lift end-to-end precision noticeably.
Tactic 3 — explicit refusal scaffold in the prompt
If the retrieved context does not directly answer the question:
- Say "I don't have that in my knowledge base."
- Do not infer an answer from general knowledge.
- Suggest the user contact <HUMAN_HANDOFF>.
This sounds basic. It moves the needle anyway. The default model behaviour is to answer; you have to explicitly authorise refusal.
Tactic 4 — cite-or-die
Force the model to quote the chunk it's answering from. Output schema:
{ "answer": str, "citation": { "chunk_id": str, "quoted_span": str } }
Then verify the quote actually appears in the cited chunk. If it doesn't, treat as a hallucination, suppress, return refusal. This catches the "model invents an answer that sounds like the docs" failure mode reliably.
Tactic 5 — eval-gated deploys
Build a held-out set of 30-100 question/correct-answer pairs. Run it on every prompt change, every embedding change, every reranker change. If hallucination rate goes up by >1% on a change, the change doesn't ship.
What didn't work
- "Be careful not to hallucinate" in the prompt. Pure incantation. Doesn't survive contact with edge cases.
- Using a bigger model "to be safer". Bigger models hallucinate less in absolute terms but produce more convincing hallucinations. Higher-quality wrongness is worse, not better.
- Showing the model the user's question phrased ten ways. Adds tokens without adding signal.
Numbers from a real deploy
| Phase | Hallucination rate (eval set) |
|---|---|
| Naive RAG, top-5 | ~18% |
| + chunk dedup | ~14% |
| + refusal scaffold | ~9% |
| + cite-or-die | ~3% |
| + rerank (recall > 0.8 met) | ~2% |
Eval set was 80 questions, scored by a human reviewer against the source corpus. Numbers are illustrative of the shape of improvement, not a vendor benchmark.