OrionAI Build logo orionai.build

Building RAG That Doesn't Hallucinate: 5 Tactics That Move the Needle

By OrionAI Build Editorial · Published 2026-05-10 · // guide

Chunk deduplication before indexing

In practice, any enterprise knowledge base contains a 10‑15 % layer of near‑duplicate text: legal footers, navigation menus, syndicated press releases, and versioned FAQs. When you index each duplicate as a separate vector, the similarity search returns a noisy mix of identical or almost‑identical chunks. The downstream language model then sees multiple “answers” that differ only in phrasing, and it averages them during generation, which frequently yields a hallucinated statement that blends facts from two unrelated sources.

Our production pipeline now runs a two‑stage deduplication pass:

  1. Compute a 384‑dimensional text‑embedding‑ada‑002 vector for every raw paragraph.
  2. Cluster vectors with MinHash LSH (threshold 0.85 cosine similarity). Within each cluster, retain the longest chunk and discard the rest.

This process reduces the index size by roughly 12 % and, more importantly, lifts the retrieval‑precision@5 from 0.71 to 0.84 on our held‑out set of 200 queries. The end‑to‑end hallucination rate—measured as the proportion of answers that contain any uncited factual error—drops from 18 % to 14 % after this single step. In our experience, the improvement is linear: each additional 1 % reduction in duplicate density yields about a 0.5 % reduction in hallucinations.

Rerank only after confirming high recall

A cross‑encoder reranker is a powerful but expensive tool. It re‑scores the top‑k retrieved vectors (typically k = 100) using a bi‑encoder such as cross‑encoder/ms‑marco‑T5‑large. The key insight is that reranking cannot create a relevant chunk that the retriever never surfaced. If the retriever’s recall@100 is below 0.80, the reranker merely shuffles irrelevant material, and the final answer often drifts further from the source.

We therefore enforce a recall gate:

# Pseudo‑code
retrieved = vector_store.search(query, k=100)
recall = eval_recall(retrieved, ground_truth_chunks)
if recall >= 0.80:
    reranked = cross_encoder.rerank(query, retrieved)
    top_chunks = reranked[:5]
else:
    top_chunks = retrieved[:5]  # fallback to raw retrieval

On a corpus of 1.2 M support tickets, moving from 0.78 to 0.82 recall (by adding a supplemental BM25 pre‑filter) reduced the hallucination rate from 9 % to 6 % after reranking. When recall was already >0.90, the same reranker shaved another 1 % off hallucinations, confirming diminishing returns.

Explicit refusal scaffolding in the prompt

The language model’s default conditioning is “answer the question”. Without an explicit guard, it will fabricate a plausible‑sounding sentence even when the retrieved context is silent. We embed a refusal clause as the first line of the system prompt and enforce it with a downstream validator:

# System prompt fragment
You are a helpful assistant. If the retrieved context does not contain a direct answer, respond exactly with:
"I don't have that information in my knowledge base. Please contact a human operator."

After deploying this scaffold, we observed a 5 % absolute drop in hallucinations (from 14 % to 9 %). The trade‑off is a modest rise in “I don’t know” responses—from 2 % to 7 %—but the overall precision improves because every false positive is eliminated.

Cite‑or‑die output schema

Requiring the model to emit a structured citation forces a tight coupling between text generation and source material. Our JSON schema looks like this:

{
  "answer": "The API rate limit is 1,000 requests per minute.",
  "citation": {
    "chunk_id": "doc_2023_04_12_001",
    "quoted_span": "Rate limit: 1,000 req/min"
  }
}

We post‑process each response with a verifier that loads chunk_id from the vector store and checks that quoted_span appears verbatim (ignoring whitespace). If the span is missing or altered, we treat the answer as a hallucination, replace it with the refusal clause, and log the incident for retraining.

In our rollout, this schema cut hallucinations from 9 % to 3 % while preserving answer length. The verifier adds ~15 ms latency per request, which is acceptable for most internal SLAs (≤200 ms total latency).

Eval‑gated deployments

Every code change—whether it is a new embedding model, a different tokenizer, or a tweak to the reranker threshold—must pass a regression gate. The gate consists of a curated evaluation set of 80 real‑world questions, each paired with a verified ground‑truth answer drawn from the source corpus.

Our CI pipeline runs the full RAG stack on this set after each commit:

  1. Generate answers with the candidate build.
  2. Run the citation verifier.
  3. Compute hallucination rate = (hallucinated answers) / 80.
  4. If the rate increases by more than 1 % relative to the baseline, abort the deployment.

This disciplined approach prevented two regressions that would have otherwise shipped: a switch from text‑embedding‑3‑large to a newer 768‑dim model (which introduced subtle vector drift) and a hyper‑parameter change in the cross‑encoder that improved NDCG but doubled hallucinations.

What didn’t work

Empirical progression on a production RAG

Below is a condensed timeline from the first naive deployment to the final gated release. All numbers are derived from the same 80‑question evaluation set, scored by two senior engineers.

Phase                         Hallucination Rate
-------------------------------------------------
Naive RAG, top‑5 retrieval        18 %
+ Chunk deduplication             14 %
+ Refusal scaffold                9 %
+ Cite‑or‑die schema              3 %
+ Rerank (recall ≥ 0.80)          2 %

Note that the absolute percentages are less important than the monotonic decline; each tactic contributed an independent reduction.

Monitoring hallucinations in production

Even after a gated release, drift can occur as the underlying corpus evolves. We therefore instrument three real‑time signals:

  1. Citation mismatch rate. Percentage of responses where the verifier cannot locate the quoted span. Alert threshold: 0.5 % over a 5‑minute window.
  2. Refusal frequency. Spike in “I don’t have that information” responses can indicate a regression in retrieval recall. Alert threshold: 10 % increase week‑over‑week.
  3. Human‑in‑the‑loop audit. Randomly sample 1 % of live queries and have a subject‑matter expert label hallucinations. Feed the results back into the CI gate.

Our alerting dashboard (Grafana + Loki) has caught three incidents in the past six months: a mis‑configured index refresh that re‑introduced duplicates, a version bump of the cross‑encoder that subtly altered scoring, and a downstream API latency spike that caused the reranker to timeout, forcing the fallback path.

Scaling the pipeline

When the knowledge base grew from 200 k to 1.5 M documents, we observed two bottlenecks:

Even at peak load (≈2 k QPS), overall latency stayed under 250 ms, well within our SLA of 300 ms.

Future‑proofing the safety stack

Two avenues are under active investigation:

  1. Self‑verification. Prompt the model to generate a second “self‑check” statement that repeats the citation verification logic. Early trials show a 0.7 % further reduction in hallucinations without extra compute.
  2. Retrieval‑augmented fine‑tuning. Fine‑tune a small decoder (e.g., Llama‑2‑7B‑Chat) on the {question, retrieved_chunks, answer, citation} quadruple. The model learns to trust the citation field, reducing the need for an external verifier.

Both approaches keep the core philosophy intact: rely on concrete evidence from the index, never on the model’s imagination.

This is part of the Production AI cornerstone series.