Why We Don't Call an LLM at Query Time (And Why You Shouldn't Either)
Most RAG pipelines call an LLM on every single user query. We stopped doing that - and the system got faster, cheaper, more reliable, and more accurate simultaneously.
Most RAG pipelines call an LLM on every single user query. We stopped doing that at Ragionex - POST https://api.ragionex.com/v1/knowledge/search returns pre-processed passages with no generation in the hot path - and the system got faster, cheaper, more reliable, and more accurate simultaneously.
For an entire class of problems - documentation retrieval, compliance search, customer support knowledge bases - calling an LLM at query time is not just wasteful. It is the wrong architectural choice for the problem.
Here is the case for front-loading your intelligence.
The Runtime LLM Tax Nobody Talks About
Every time your RAG system calls an LLM to process a user query, you pay four costs simultaneously. Most teams only notice the first one.
Latency. Frontier LLM time-to-first-token ranges from 500ms to 2 seconds depending on the model and provider. Full response generation for a typical RAG synthesis adds 2-8 seconds on top. Real-world RAG systems typically respond in a few seconds under normal conditions, with complex queries or heavy load extending that to a few minutes. Your users are waiting for an LLM to rephrase information that already exists in a perfectly good document.
Cost. Frontier LLM APIs charge $2.50-$15.00 per million input tokens and $10.00-$25.00 per million output tokens depending on the model tier. A typical RAG query sends 1,500-3,000 tokens of retrieved context plus the question, and gets back 200-500 tokens. At scale - say 100,000 queries per month - you are spending around $1,000 monthly just on the generation step. For what? To rephrase a paragraph the user could have read directly.
Inconsistency. Ask the same LLM the same question with the same context three times. You will get three different answers. Temperature, sampling, and the inherent stochasticity of autoregressive generation mean your system is non-deterministic by design. For documentation retrieval - where the answer is a known, fixed piece of text - this is a defect, not a feature.
Hallucination. Even with retrieved context sitting right there in the prompt, LLMs hallucinate. They interpolate. They "helpfully" add information that was not in the source material. In compliance, legal, and technical documentation use cases, a single hallucinated detail can be catastrophic. You are paying money to introduce a failure mode that did not need to exist.
The Mental Model Shift: Compile, Don't Interpret
Here is an analogy that makes the architecture choice obvious.
Traditional RAG works like an interpreted language. Every time a user asks a question, the system reads the query, retrieves some passages, hands everything to an LLM, and the LLM "interprets" the context into a response in real time. Every query is processed from scratch. Every query pays full computational cost.
A Context Engine works like a compiled language. All the AI work happens once, at build time. The output is a compiled knowledge base, optimized for instant retrieval. At query time, there is no interpretation step. The system returns pre-existing answers at machine speed.
The compiled/interpreted distinction is fundamental in software engineering for a reason. Compiled programs run faster because the translation work is done ahead of time. Interpreted programs offer flexibility at the cost of runtime performance. The same tradeoff applies to knowledge retrieval systems.
The question is: does your documentation change its answers based on who is asking? No. The content is static. The knowledge is fixed. So why are you re-interpreting it on every single query?
What We Built Instead
At Ragionex, we took this philosophy to its logical conclusion. Running AI on every query is what makes RAG slow and unpredictable. We moved that work to build time. Documents go in. A queryable knowledge base comes out - and queries skip the AI entirely.
The numbers speak for themselves:
- Response time: under 200 milliseconds. Not seconds. Milliseconds. While traditional RAG systems spend 80%+ of their latency budget on LLM generation, we skip that step entirely. Our latency is dominated by a single fast lookup.
- Per-query AI cost: zero. There is no API call, no token metering, no per-query charge. One small flat monthly subscription - your bill stays the same whether you run 100 queries or 1 million.
- Deterministic results. Same question, same answer, every time. No temperature variance. No sampling randomness. No "creative" reinterpretation of your compliance documentation. If the answer is in the knowledge base, you get that answer. Verbatim. Reliably.
- Understands what you mean, not just what you type. Real users phrase things badly. They mistype. They paraphrase. Standard RAG often fails when the question doesn't match the documentation wording. Ours doesn't. Same answer comes back, regardless of how the question is asked.
The Latency Breakdown That Should Embarrass Traditional RAG
Let's look at where time actually goes in a typical RAG pipeline, based on published benchmarks and typical observed ranges:
| Component | Traditional RAG | Context Engine |
|---|---|---|
| LLM generation | 1,500ms - 8,000ms+ | 0ms |
| Total | 800ms - 8,145ms | < 200ms |
The generation step is not a minor contributor. It is the dominant cost. Latency profiling of RAG pipelines consistently shows that retrieval and prefill overheads account for the large majority of end-to-end latency - and that is before the generation tokens start arriving. Remove the generation step, and you are left with a system that responds faster than most websites load.
For voice-based AI assistants - an increasingly critical use case - this difference is the gap between a natural conversation and an awkward pause. For high-throughput API services, it is the difference between a $200/month server and a $20,000/month GPU cluster.
What You Lose (And Why It Usually Doesn't Matter)
I am not going to pretend this is free. Removing the LLM from query time means you lose three things:
1. Generative summarization. A Context Engine returns the relevant pre-processed source content. It does not synthesize a novel summary from multiple sources. If your use case requires combining information from five different documents into one coherent paragraph, you need a generation step.
2. Conversational flexibility. An LLM can rephrase, adjust tone, translate on the fly, and handle follow-up questions with context. A retrieval-only system returns what it has. It does not adapt its communication style to the user.
3. Creative reasoning. If you need the system to make inferences, draw analogies, or reason about information that is not explicitly stated in the source material, you need an LLM.
For the vast majority of enterprise knowledge retrieval use cases, none of these losses matter. Documentation search does not need creative reasoning. Compliance lookups do not need conversational flexibility. Customer support knowledge bases do not need to synthesize novel summaries - they need to find the right answer, fast, and return it verbatim.
The industry has been so intoxicated by the capabilities of LLMs that it has forgotten a basic engineering principle: use the simplest tool that solves the problem. If your problem is "find the right document section for this question," an LLM is not the simplest tool. It is the most expensive, slowest, least reliable tool you could possibly choose.
When the Trade-Off Is Worth It
Remove the LLM from query time when:
- Documentation retrieval - Technical docs, product manuals, API references. The answer exists. Find it.
- Compliance and legal - Regulations, policies, contracts. You need the exact text, not a paraphrase. Hallucination is not an acceptable risk.
- Customer support knowledge bases - FAQ answers, troubleshooting steps, how-to guides. The answers are written by humans for humans. An LLM adding its own spin is a liability, not a feature.
- Internal knowledge management - Company wikis, process documentation, onboarding materials. Employees need the canonical answer, not a generated approximation.
- Any domain where accuracy matters more than creativity - Medical information retrieval, financial data lookup, educational content delivery.
Keep the LLM at query time when:
- You genuinely need to synthesize information across multiple unrelated sources
- The user interaction is conversational and requires multi-turn context
- Creative generation (writing, brainstorming, coding) is the primary use case
- The answer does not exist in any document and must be reasoned about
The Hybrid Architecture: Best of Both Worlds
Here is what sophisticated teams are starting to figure out: you do not have to choose one or the other. You can use a Context Engine for retrieval and feed its results to an LLM only when you genuinely need generation.
User Question
|
v
Context Engine (< 200ms, $0)
|
v
Results returned to the calling AI system
|
v
The AI decides: sufficient context? Return directly.
Need synthesis? Feed to LLM for generation.
This is exactly how Ragionex is designed to work. We sit in the context layer of your AI application. Your AI sends us a question. We return the most relevant, pre-processed knowledge passages in under 200 milliseconds. Your AI uses that context however it needs to - directly, through an LLM, or through any other processing step. We never generate. We never hallucinate. We provide ground truth.
The LLM call becomes optional and targeted, not mandatory and universal. You pay for generation only when generation adds genuine value. For a typical support bot, that might be 20% of queries instead of 100% - an immediate 80% reduction in LLM costs and latency.
The Industry Is Starting to Get It
We are not alone in this thinking. The emerging discipline of "Context Engineering" - endorsed by Anthropic, Andrej Karpathy, and tracked by Gartner - is fundamentally about this same insight: the intelligence should be in how you prepare and structure context, not in how you process every individual query.
CompactRAG, a recent research project, demonstrated the same principle: decomposing the reasoning process into an offline preprocessing stage that constructs a structured knowledge base, paired with a lightweight online stage that avoids redundant LLM calls. The results showed reduced token consumption, fewer API calls, and competitive answer quality.
This is not a fringe idea. It is where the industry is heading. The question is whether you get there proactively - by design - or reactively, after your LLM bills and latency complaints force the issue.
The Bottom Line
Calling an LLM on every query is the default because it is easy, not because it is right. For knowledge retrieval use cases, it is an expensive, slow, unreliable choice that introduces failure modes (hallucination, inconsistency) into a problem that does not inherently have them.
The right architecture front-loads the intelligence: do the hard AI work once at build time, then serve results instantly, deterministically, and at zero marginal cost per query.
That is what a Context Engine does. That is what we built at Ragionex. And if your use case is documentation retrieval, compliance, customer support, or any domain where the answers already exist and just need to be found - you should build this way too.
Try it yourself. Ragionex offers a free API key for the Developer Preview. Sub-200ms responses. Zero per-query AI cost. Deterministic results. See what retrieval feels like when you stop waiting for an LLM to rephrase your own documentation back to you. Get started at ragionex.com.
Related reading: Why Your RAG Has a 70% Gap Between Best and Worst Answers goes deeper on the consistency consequences of putting an LLM in the hot path. For the agent-side argument - why agent memory should be retrieval, not LLM summarization - see Your Agent Doesn't Have a Memory Problem. It Has a Retrieval Problem.