7 RAG API Patterns Most Developers Skip
Three lines of code gets you a working RAG integration. The next seven patterns are what separate a working integration from a reliable one - and most developers skip every single one of them.
You can integrate the Ragionex API - POST https://api.ragionex.com/v1/knowledge/search - in three lines of code. But three lines of code is the beginning, not the end. Most developers get the API working in minutes and then leave significant accuracy on the table by making a handful of avoidable mistakes.
This post covers every practice that consistently makes the difference between a mediocre knowledge base integration and one that handles real user questions reliably.
1. Ask in English
This is the single most impactful thing you can do: only send English-language queries.
The Ragionex retrieval engine is optimized for English. Non-English queries return less relevant results. If your users speak other languages, translate to English before querying.
If your users write in other languages, add a translation step before querying Ragionex:
from openai import OpenAI
import requests
openai_client = OpenAI()
def translate_to_english(text):
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Translate the following to English. Output only the translation, nothing else."},
{"role": "user", "content": text}
]
)
return response.choices[0].message.content.strip()
def query_knowledge_base(user_question_any_language):
english_question = translate_to_english(user_question_any_language)
response = requests.post(
"https://api.ragionex.com/v1/knowledge/search",
headers={"X-API-Key": "YOUR_API_KEY"},
json={"question": english_question, "results": 10, "collection": "vscode-docs"},
timeout=10
)
return response.json()
Your LLM can then respond to the user in their original language while using English-retrieved context. The translation overhead is minimal compared to the accuracy gain.
2. Ask One Topic Per Question
The most common mistake developers make: sending compound questions to the API.
The problem: "How do I format code, configure the debugger, and set up Git integration in VS Code?" is three separate questions. The retrieval engine finds the best match for each topic, but it has to choose - it cannot simultaneously serve all three topics with equal precision in a single result set.
The wrong pattern:
# Bad: one compound question - retrieval accuracy suffers
response = requests.post(url, headers=headers, json={
"question": "How do I format code, configure the debugger, and set up Git integration?",
"results": 10,
"collection": "vscode-docs"
})
The right pattern:
# Good: three focused questions - each retrieval is precise
topics = [
"How to format code?",
"How to configure the debugger?",
"How to set up Git integration?"
]
results = []
for topic in topics:
r = requests.post(
url,
headers=headers,
json={"question": topic, "results": 5, "collection": "vscode-docs"},
timeout=10
)
r.raise_for_status()
results.extend(r.json()["results"])
Three focused queries with results: 5 each gives you 15 highly relevant results - more useful context than 10 mixed results from a compound query.
This principle also applies to your UI layer. If a user asks a compound question, decompose it before querying. A simple LLM call to extract sub-questions adds a few hundred milliseconds but dramatically improves result quality for complex requests.
3. Calibrate the results Parameter
The results parameter defaults to 10. This default exists because 10 is the right number for most LLM integrations. Do not lower it without understanding the tradeoff.
Guidelines by use case:
| Use case | Recommended results | Why |
|---|---|---|
| LLM integration (accuracy-first) | 10 | More context = LLM has more to reason over |
| Real-time chat (latency-sensitive) | 5 | Faster, still good coverage |
| Simple factual lookup | 3-5 | If you only need the top answer |
| Complex multi-part topics | 15-20 | Broader context for deep questions |
What you should never do: Use results: 1 hoping to get "the answer." Retrieval is probabilistic - the top result is the most semantically similar, but the second and third results frequently contain the supporting context that makes an answer complete. Give your LLM enough material to reason over.
Token budget consideration: Each result contains the full documentation content for that topic. At results: 10, you are typically passing 2,000-8,000 tokens of context to your LLM. If you are hitting context limits, reduce to 5-7 rather than going below 5.
4. Filter Media Description Blocks Before Display
Some answers include extra prose inside <media-description> blocks. These describe what's visible in screenshots and videos - useful context when feeding the answer to your LLM, but not what you want appearing in your user-facing UI.
The rule: keep them in the LLM context. Strip them before rendering for users.
# For LLM context - keep as-is
llm_context = result['answer']
# For user display - filter out the description blocks
clean_answer = filter_media_blocks(result['answer'])
When the LLM gets the full answer content with these blocks, it can answer questions about what's shown in screenshots and videos. When the user sees the answer, the image renders normally without descriptive prose appearing alongside it.
5. Build Your LLM Integration Correctly
The integration pattern that consistently produces the best results in production:
import re
import requests
from openai import OpenAI
client = OpenAI()
SYSTEM_TEMPLATE = """You are a helpful assistant. Answer questions based ONLY on the context provided below.
If the context does not contain enough information to answer the question, say "I don't have information about that in the documentation."
When referencing specific information, cite the source URL from the context.
Context:
{context}"""
def ask(question, collection="vscode-docs"):
# Step 1: Retrieve context
resp = requests.post(
"https://api.ragionex.com/v1/knowledge/search",
headers={"X-API-Key": "YOUR_API_KEY"},
json={"question": question, "results": 10, "collection": collection},
timeout=10
)
resp.raise_for_status()
data = resp.json()
if not data["success"]:
return f"Retrieval error: {data.get('error', 'unknown')}"
# Step 2: Build context with sources (keep media descriptions for LLM)
context = "\n\n---\n\n".join(
f"[Source: {r['source']}]\n{r['answer']}"
for r in data["results"]
)
# Step 3: Generate answer
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_TEMPLATE.format(context=context)},
{"role": "user", "content": question}
]
)
return completion.choices[0].message.content
What makes this pattern work:
"Answer based ONLY on this context"- constrains the LLM to the verified source material. Without this instruction, the LLM blends retrieved context with its parametric memory, which reintroduces hallucination risk.- Source URLs included in context - the LLM can cite them naturally in its response without extra logic on your side.
- Explicit
successcheck - prevents silently passing empty results to the LLM. results: 10- gives the LLM enough material to reason over, especially for multi-part answers.
6. Handle Errors Explicitly
The API returns a success boolean in every response. Always check it before accessing the results array - do not assume success.
data = response.json()
if not data["success"]:
error = data.get("error", "unknown error")
# Handle error - do not proceed to LLM
return f"Could not retrieve context: {error}"
results = data["results"]
Common error scenarios and responses:
| HTTP Status | Cause | Action |
|---|---|---|
| 401 | Invalid API key | Check the X-API-Key header value |
| 422 | Invalid parameters | Question empty or over 128 chars, results outside 1-50, or collection over 64 chars |
| 429 | Rate limit exceeded (30/min per IP) | Implement exponential backoff |
| 503 | Service warming up | Retry with exponential backoff |
Rate limiting pattern for batch use:
The API allows 30 requests per minute per IP. For real-time applications this is rarely a constraint. For batch processing, implement a simple token bucket:
import time
from collections import deque
class RateLimitedClient:
def __init__(self, requests_per_minute=25): # Leave headroom below the 30/min limit
self.rpm = requests_per_minute
self.timestamps = deque()
def search(self, question, collection, results=10):
now = time.time()
# Remove timestamps older than 60 seconds
while self.timestamps and now - self.timestamps[0] > 60:
self.timestamps.popleft()
# Wait if at limit
if len(self.timestamps) >= self.rpm:
sleep_time = 60 - (now - self.timestamps[0])
if sleep_time > 0:
time.sleep(sleep_time)
self.timestamps.append(time.time())
response = requests.post(
"https://api.ragionex.com/v1/knowledge/search",
headers={"X-API-Key": "YOUR_API_KEY"},
json={"question": question, "results": results, "collection": collection},
timeout=10
)
response.raise_for_status()
return response.json()
7. Use the source and id Fields
Every result includes two fields that most integrations ignore.
source - the URL of the original source document. Surface this to your users. An answer with a clickable source link is trusted more than one that appears from nowhere. It also gives users the full document context when they need more detail than the retrieved result provides. Note: the URL points to the raw source file - if you want to show a rendered documentation page, transform it accordingly before displaying.
for result in data["results"]:
answer = strip_media_descriptions(result['answer'])
source = result['source']
print(f"Answer: {answer[:300]}...")
print(f"Read more: {source}")
print("---")
id - a stable identifier for each result. Log it alongside every query. When a user reports an inaccurate answer, the id lets you look up exactly which result was returned and why - without having to reproduce the query and guess which result caused the problem.
import logging
logger = logging.getLogger(__name__)
def search_and_log(question, collection):
data = requests.post(...).json()
if data["success"]:
result_ids = [r["id"] for r in data["results"]]
logger.info("query=%r collection=%s retrieved=%s", question, collection, result_ids)
return data
This makes debugging bad answers tractable at scale. "Result ID X7KQ2P came back for this query" is a solvable problem. "The answer was wrong" without any identifiers is not.
Summary
| Practice | Impact |
|---|---|
| Query in English | High - non-English queries reduce semantic similarity scores |
| One topic per question | High - compound questions dilute result relevance |
results: 10 for LLM integration | Medium - more context enables better LLM reasoning |
Keep <media-description> for LLM | High - enables visual Q&A from images and videos |
Strip <media-description> for display | Required - prevents raw prose appearing in your UI |
Check success field | Required - silent failures lead to confusing LLM responses |
Log id field | Medium - essential for debugging incorrect answers |
These practices take ten minutes to implement. The ones that matter most - handling media descriptions correctly and asking focused single-topic questions - are invisible in simple demos but show up immediately when real users ask real questions.
The Developer Preview API is live with a VS Code documentation knowledge base. Try it free at ragionex.com - the API key is on the homepage.
Related reading: for the architectural rationale behind these patterns, see Why We Don't Call an LLM at Query Time. Building agents that also need persistent memory across sessions? Add Persistent Memory to Cursor, Claude Code, and Windsurf in 3 Lines covers the same integration shape for the Memory product.
Have questions? Join the Ragionex Discord or email [email protected].