Building RAG That Doesn't Hallucinate: 5 Tactics That Move the Needle

By OrionAI Build Editorial · Published 2026-05-10 · // guide

Five tactics that actually moved hallucination rate down on a real production RAG system. Numbers, code, and the tactics that didn't work.

Tactic 1 — chunk dedup before indexing

Most corpora have near-duplicates. Boilerplate footers. Re-published articles. Multiple versions of the same FAQ. Each duplicate chunk dilutes retrieval — the right answer gets averaged with the wrong one. Run a near-dup pass at index time (MinHash or simple cosine on embeddings) and drop duplicates. On the corpus where I first ran this, hallucination rate dropped 18% just from this step.

Tactic 2 — rerank, but only when retrieval recall is high

Reranking only helps if your top-100 retrieval contains the right answer. If it doesn't, reranking just changes which wrong chunk wins. Measure retrieval recall@100 on a held-out set first. If it's below 0.8, fix retrieval before adding a reranker. Above 0.8, a small cross-encoder reranker can lift end-to-end precision noticeably.

Tactic 3 — explicit refusal scaffold in the prompt

If the retrieved context does not directly answer the question:
  - Say "I don't have that in my knowledge base."
  - Do not infer an answer from general knowledge.
  - Suggest the user contact <HUMAN_HANDOFF>.

This sounds basic. It moves the needle anyway. The default model behaviour is to answer; you have to explicitly authorise refusal.

Tactic 4 — cite-or-die

Force the model to quote the chunk it's answering from. Output schema:

{ "answer": str, "citation": { "chunk_id": str, "quoted_span": str } }

Then verify the quote actually appears in the cited chunk. If it doesn't, treat as a hallucination, suppress, return refusal. This catches the "model invents an answer that sounds like the docs" failure mode reliably.

Tactic 5 — eval-gated deploys

Build a held-out set of 30-100 question/correct-answer pairs. Run it on every prompt change, every embedding change, every reranker change. If hallucination rate goes up by >1% on a change, the change doesn't ship.

What didn't work

"Be careful not to hallucinate" in the prompt. Pure incantation. Doesn't survive contact with edge cases.
Using a bigger model "to be safer". Bigger models hallucinate less in absolute terms but produce more convincing hallucinations. Higher-quality wrongness is worse, not better.
Showing the model the user's question phrased ten ways. Adds tokens without adding signal.

Numbers from a real deploy

Phase	Hallucination rate (eval set)
Naive RAG, top-5	~18%
+ chunk dedup	~14%
+ refusal scaffold	~9%
+ cite-or-die	~3%
+ rerank (recall > 0.8 met)	~2%

Eval set was 80 questions, scored by a human reviewer against the source corpus. Numbers are illustrative of the shape of improvement, not a vendor benchmark.

Model APIs — vetted picks

Anthropic OpenAI ElevenLabs Cartesia Together AI Groq

GPU & compute — vetted picks

RunPod Vast.ai Modal Replicate Lambda Labs Hetzner

Dev tools — vetted picks

Cursor Aider Continue GitHub Copilot