Why Your RAG System Still Hallucinates (And What to Do About It)

Most engineering teams adopt Retrieval-Augmented Generation with a simple belief: give the LLM the right documents and it will stop making things up. Peer-reviewed research across law, medicine, and general knowledge domains confirms otherwise: RAG reduces hallucination rates but does not eliminate them. The gap between "reduced" and "eliminated" is where real damage happens.

The architectural fix - making hallucination at the retrieval layer structurally impossible - is what Ragionex ships through POST https://api.ragionex.com/v1/knowledge/search.

The Evidence: RAG Systems Hallucinate at Alarming Rates

Legal AI: 17-33% Hallucination Despite RAG

The most rigorous evaluation of RAG-based AI tools to date comes from Stanford's RegLab and the Institute for Human-Centered Artificial Intelligence (HAI). In their preregistered study "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools", researchers Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, and Daniel E. Ho tested the leading commercial legal AI products - all of which use retrieval-augmented generation as their core architecture.

The findings were stark:

Lexis+ AI (LexisNexis) hallucinated on more than 17% of queries and delivered accurate responses on only 65% of them.
Westlaw AI-Assisted Research (Thomson Reuters) hallucinated at nearly 33% of queries with accurate responses on just 42%.
General-purpose LLMs without RAG (GPT-4, GPT-3.5, Llama 2) hallucinated between 58% and 88% of the time (Dahl et al., 2024).

Yes, RAG cut hallucination rates roughly in half compared to vanilla LLMs. But "halving" a catastrophic failure rate still leaves a catastrophic failure rate. In legal work, a single fabricated case citation can result in sanctions, malpractice liability, and harm to clients. The Stanford team concluded that the providers' "hallucination-free" marketing claims were "overstated."

Published in the Journal of Empirical Legal Studies (2025), this remains the first preregistered empirical evaluation of commercial RAG-based legal research tools.

Medical AI: 0-6% Hallucination with Curated Sources

In healthcare, the stakes are arguably even higher. A 2025 study published in JMIR Cancer - "Reducing Hallucinations and Trade-Offs in Responses in Generative AI Chatbots for Cancer Information" by Nishisako, Higashi, and Wakao - tested RAG chatbots built on GPT-4 and GPT-3.5 using 62 cancer-related questions.

Their results across chatbot configurations (overall rates across all 62 questions):

Configuration	GPT-4	GPT-3.5
RAG with curated cancer info source	0% hallucination	3% hallucination
RAG with Google search results	13% hallucination	23% hallucination
No RAG (conventional chatbot)	~37% hallucination	~40% hallucination

The best-case result - 0% hallucination with GPT-4 and a curated, domain-specific source - looks promising at first glance. But note two things. First, when the questions went beyond the curated source's coverage, hallucination rates jumped to 19% for GPT-4 and 35% for GPT-3.5, even with RAG active. Second, the test set contained only 62 questions. In production, you face thousands of unpredictable queries. A 0% result on 62 controlled questions does not mean 0% at scale.

The pattern across both studies is clear: RAG helps, but it does not solve the problem. The hallucination rate depends heavily on the quality of the retrieval corpus, the complexity of the query, and the specific LLM used. Change any of those variables and the rate shifts - sometimes dramatically.

Why RAG Still Hallucinates: The Generation Problem

To understand why RAG fails to eliminate hallucination, you need to understand what RAG actually does at query time.

A standard RAG pipeline (which Ragionex's Context Engine deliberately does not follow) has two phases:

Retrieval: The user's question is matched to relevant documents from a knowledge base. These documents are injected into the LLM's context window.
Generation: The LLM reads the retrieved documents alongside the user's question, then generates a new natural-language answer.

The retrieval phase is deterministic. Given the same question and the same knowledge base, you get the same documents every time. There is no hallucination risk in retrieval itself.

The generation phase is where hallucination enters. The LLM is a probabilistic text generator. It does not "look up" answers - it predicts the next token based on learned statistical patterns. Even with perfect context documents sitting in its prompt, the LLM can:

Ignore retrieved context and rely on its parametric memory instead
Misinterpret the context and generate a plausible-sounding but incorrect synthesis
Merge information from multiple retrieved passages in ways the source material does not support
Fabricate details that seem consistent with the retrieved context but are not actually present in it
Over-generalize specific statements into broader claims the source never made

The Mechanistic Explanation

Recent research in mechanistic interpretability confirms this at the neural architecture level. In "ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability" (Sun et al., published at ICLR 2025), researchers traced the internal computations of LLMs during RAG and discovered a specific failure mode:

Knowledge Feed-Forward Networks (FFNs) in the transformer overemphasize the model's parametric knowledge - what it learned during pretraining - in the residual stream. Simultaneously, Copying Heads - the attention components responsible for transferring information from the retrieved context into the output - fail to effectively retain or integrate the external knowledge.

In plain language: the model's internal "memory" overpowers the external documents you fed it. The retrieved context is present in the prompt but gets progressively drowned out by the model's own learned biases as computation flows through the network layers. The model defaults to what it "knows" rather than what you showed it.

The ReDeEP paper does not frame this as a misconfigured hyperparameter. It characterizes the behavior as emerging from how Knowledge FFNs and Copying Heads interact during generation - the model's internal memory systematically overweights parametric knowledge relative to in-context information, and those dynamics are not controllable from the outside.

The Fundamental Issue: Generation Equals Risk

The conclusion that follows from the research is uncomfortable but mathematically unavoidable:

As long as there is a generative step at query time, hallucination risk is non-zero.

It does not matter how good your retrieval is. It does not matter how well-curated your knowledge base is. It does not matter which LLM you use or how carefully you craft your system prompt. The moment you ask a language model to generate text based on retrieved context, you have introduced a probabilistic process that can produce outputs unfaithful to the source material.

This is not a pessimistic claim - it is a mathematical certainty. Language models are stochastic. Their outputs are sampled from probability distributions. Any non-degenerate sampling strategy (temperature > 0) introduces variance, and that variance can manifest as hallucination. Even with temperature set to 0 (greedy decoding), the model's learned biases can still override retrieved context, as the ReDeEP research demonstrates.

The industry response has been to layer mitigations on top of the generative step: better prompting, citation verification, confidence scoring, output filtering, fine-tuning for faithfulness. These are valuable engineering efforts. They reduce hallucination rates. But they cannot eliminate hallucination for the same reason you cannot eliminate rounding errors by adding more decimal places - the fundamental architecture produces them.

The Alternative: Remove the Generative Step Entirely

If the generative step is the source of hallucination risk, the logical solution is to remove it.

Instead of retrieving documents and then asking an LLM to generate an answer, what if you retrieved the answer directly? No generation. No synthesis. No probabilistic token prediction. Just: find the most relevant pre-existing documentation and return it verbatim.

This is not a theoretical proposal. It is a working architecture. The approach works as follows:

Documentation is pre-processed and indexed before any query arrives.
When a question comes in, the system performs semantic search to find the most relevant pre-existing content.
The matched documentation is returned directly to the caller - no LLM generates or rewrites anything at query time.

Hallucination at the retrieval layer is zero by construction - not zero in testing, not zero on a benchmark of 62 questions, but zero as an architectural property. There is no generative component to produce hallucinations. Every response is drawn from pre-existing documentation. Nothing is written, synthesized, or reworded at query time.

The tradeoff is clear: you lose the ability to synthesize novel answers that combine information from multiple sources, or to rephrase documentation in conversational language. What you gain is absolute fidelity to your source material. For use cases where accuracy is non-negotiable - legal, medical, financial, compliance, technical documentation - this tradeoff is not just acceptable, it is the correct engineering decision.

How This Works in Practice

Ragionex is a context engine built on this principle. It sits in the context layer of AI applications: the customer's AI (chatbot, assistant, agent) sends a question, Ragionex retrieves the most relevant pre-processed documentation, and returns it. The customer's AI can then use that context however it needs to.

The key architectural decisions:

No LLM at query time. Ragionex does not run inference on a language model when responding to search requests. Retrieval is a fast lookup against pre-indexed content.
Pre-existing documentation only. Every response is sourced from documentation that was processed and stored before the query arrived. Nothing is generated, synthesized, or rephrased on the fly.
Sub-200ms response times. Without an LLM inference step, responses typically return in under 200ms in our testing. Compare this to the 2-10 second latency typical of RAG systems that run LLM generation.
Zero runtime AI cost. No tokens are consumed at query time. The economics scale with storage, not with per-query LLM API calls.

This does not replace the customer's AI - it feeds it. The customer's chatbot or assistant still uses an LLM to generate conversational responses. But the context it reasons over comes from verified, pre-existing documentation rather than from another round of generation. If the customer's AI hallucinates, the source material is still there for verification. The context layer itself introduces no hallucination risk.

When to Use Which Architecture

RAG with LLM generation is appropriate when:

Conversational, synthesized answers are more important than strict accuracy
The query domain is broad and unpredictable
Approximate answers are acceptable
Users expect natural-language dialogue

Retrieval-only (context engine) architecture is appropriate when:

Accuracy is non-negotiable (legal, medical, financial, compliance)
Source fidelity matters more than conversational fluency
You need auditable, traceable responses
Response latency and cost must be predictable
You are building an AI application and need a reliable context source

The right choice depends on your tolerance for error. If a 3% hallucination rate is acceptable for your use case, RAG with generation is a reasonable approach. If your domain penalizes inaccuracy - and most professional domains do - the generative step is a liability you can remove.

Conclusion

RAG was a meaningful advance over vanilla LLM generation. It brought hallucination rates down from the 40-88% range to single digits in the best cases.

But the framing of RAG as a hallucination "solution" has always been misleading. The peer-reviewed evidence is unambiguous: RAG systems hallucinate at rates between 0% and 33% depending on the domain, the corpus quality, the LLM, and the query complexity. The generative step - the LLM producing new text at query time - is the architectural root cause, and no amount of prompt engineering or retrieval optimization can eliminate it.

For applications where accuracy is a requirement rather than a goal, the path forward is to remove the generative step from the retrieval layer entirely. Return pre-existing documentation. Let the customer's AI handle the generation if it wants to. Keep the context layer hallucination-free by design, not by hope.

Note the asymmetry here: zero hallucination at the retrieval layer is an architectural property - it holds by construction under any conditions. A 3% hallucination rate from a well-tuned RAG system is an empirical result measured under controlled test conditions. In production, with unpredictable queries and evolving documentation, empirical rates shift. Architectural properties do not.

Related reading: Why We Don't Call an LLM at Query Time walks through the cost and latency consequences of the same architectural choice, and Why Your RAG Has a 70% Gap Between Best and Worst Answers shows the consistency side. For agent memory built on the same retrieval-only premise, see Persistent Memory Without the Vector Database.

Sources:

Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., & Ho, D. E. (2025). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Journal of Empirical Legal Studies.
Nishisako, S., Higashi, T., & Wakao, F. (2025). Reducing Hallucinations and Trade-Offs in Responses in Generative AI Chatbots for Cancer Information: Development and Evaluation Study. JMIR Cancer.
Sun, Z., et al. (2025). ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability. ICLR 2025.
Dahl, M., Magesh, V., Suzgun, M., & Ho, D. E. (2024). Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. Journal of Legal Analysis, 16(1), 64-93.