7 RAG API Patterns Most Developers Skip

You can integrate the Ragionex API - POST https://api.ragionex.com/v1/knowledge/search - in three lines of code. But three lines of code is the beginning, not the end. Most developers get the API working in minutes and then leave significant accuracy on the table by making a handful of avoidable mistakes.

This post covers every practice that consistently makes the difference between a mediocre knowledge base integration and one that handles real user questions reliably.

1. Ask in English

This is the single most impactful thing you can do: only send English-language queries.

The Ragionex retrieval engine is optimized for English. Non-English queries return less relevant results. If your users speak other languages, translate to English before querying.

If your users write in other languages, add a translation step before querying Ragionex:

from openai import OpenAI
import requests

openai_client = OpenAI()

def translate_to_english(text):
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Translate the following to English. Output only the translation, nothing else."},
            {"role": "user", "content": text}
        ]
    )
    return response.choices[0].message.content.strip()

def query_knowledge_base(user_question_any_language):
    english_question = translate_to_english(user_question_any_language)

    response = requests.post(
        "https://api.ragionex.com/v1/knowledge/search",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={"question": english_question, "results": 10, "collection": "vscode-docs"},
        timeout=10
    )
    return response.json()

Your LLM can then respond to the user in their original language while using English-retrieved context. The translation overhead is minimal compared to the accuracy gain.

2. Ask One Topic Per Question

The most common mistake developers make: sending compound questions to the API.

The problem: "How do I format code, configure the debugger, and set up Git integration in VS Code?" is three separate questions. The retrieval engine finds the best match for each topic, but it has to choose - it cannot simultaneously serve all three topics with equal precision in a single result set.

The wrong pattern:

# Bad: one compound question - retrieval accuracy suffers
response = requests.post(url, headers=headers, json={
    "question": "How do I format code, configure the debugger, and set up Git integration?",
    "results": 10,
    "collection": "vscode-docs"
})

The right pattern:

# Good: three focused questions - each retrieval is precise
topics = [
    "How to format code?",
    "How to configure the debugger?",
    "How to set up Git integration?"
]

results = []
for topic in topics:
    r = requests.post(
        url,
        headers=headers,
        json={"question": topic, "results": 5, "collection": "vscode-docs"},
        timeout=10
    )
    r.raise_for_status()
    results.extend(r.json()["results"])

Three focused queries with results: 5 each gives you 15 highly relevant results - more useful context than 10 mixed results from a compound query.

This principle also applies to your UI layer. If a user asks a compound question, decompose it before querying. A simple LLM call to extract sub-questions adds a few hundred milliseconds but dramatically improves result quality for complex requests.

3. Calibrate the `results` Parameter

The results parameter defaults to 10. This default exists because 10 is the right number for most LLM integrations. Do not lower it without understanding the tradeoff.

Guidelines by use case:

Use case	Recommended `results`	Why
LLM integration (accuracy-first)	10	More context = LLM has more to reason over
Real-time chat (latency-sensitive)	5	Faster, still good coverage
Simple factual lookup	3-5	If you only need the top answer
Complex multi-part topics	15-20	Broader context for deep questions

What you should never do: Use results: 1 hoping to get "the answer." Retrieval is probabilistic - the top result is the most semantically similar, but the second and third results frequently contain the supporting context that makes an answer complete. Give your LLM enough material to reason over.

Token budget consideration: Each result contains the full documentation content for that topic. At results: 10, you are typically passing 2,000-8,000 tokens of context to your LLM. If you are hitting context limits, reduce to 5-7 rather than going below 5.

4. Filter Media Description Blocks Before Display

Some answers include extra prose inside <media-description> blocks. These describe what's visible in screenshots and videos - useful context when feeding the answer to your LLM, but not what you want appearing in your user-facing UI.

The rule: keep them in the LLM context. Strip them before rendering for users.

# For LLM context - keep as-is
llm_context = result['answer']

# For user display - filter out the description blocks
clean_answer = filter_media_blocks(result['answer'])

When the LLM gets the full answer content with these blocks, it can answer questions about what's shown in screenshots and videos. When the user sees the answer, the image renders normally without descriptive prose appearing alongside it.

5. Build Your LLM Integration Correctly

The integration pattern that consistently produces the best results in production:

import re
import requests
from openai import OpenAI

client = OpenAI()

SYSTEM_TEMPLATE = """You are a helpful assistant. Answer questions based ONLY on the context provided below.

If the context does not contain enough information to answer the question, say "I don't have information about that in the documentation."

When referencing specific information, cite the source URL from the context.

Context:
{context}"""

def ask(question, collection="vscode-docs"):
    # Step 1: Retrieve context
    resp = requests.post(
        "https://api.ragionex.com/v1/knowledge/search",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={"question": question, "results": 10, "collection": collection},
        timeout=10
    )
    resp.raise_for_status()
    data = resp.json()

    if not data["success"]:
        return f"Retrieval error: {data.get('error', 'unknown')}"

    # Step 2: Build context with sources (keep media descriptions for LLM)
    context = "\n\n---\n\n".join(
        f"[Source: {r['source']}]\n{r['answer']}"
        for r in data["results"]
    )

    # Step 3: Generate answer
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_TEMPLATE.format(context=context)},
            {"role": "user", "content": question}
        ]
    )

    return completion.choices[0].message.content

What makes this pattern work:

"Answer based ONLY on this context" - constrains the LLM to the verified source material. Without this instruction, the LLM blends retrieved context with its parametric memory, which reintroduces hallucination risk.
Source URLs included in context - the LLM can cite them naturally in its response without extra logic on your side.
Explicit success check - prevents silently passing empty results to the LLM.
results: 10 - gives the LLM enough material to reason over, especially for multi-part answers.

6. Handle Errors Explicitly

The API returns a success boolean in every response. Always check it before accessing the results array - do not assume success.

data = response.json()

if not data["success"]:
    error = data.get("error", "unknown error")
    # Handle error - do not proceed to LLM
    return f"Could not retrieve context: {error}"

results = data["results"]

Common error scenarios and responses:

HTTP Status	Cause	Action
401	Invalid API key	Check the `X-API-Key` header value
422	Invalid parameters	Question empty or over 128 chars, `results` outside 1-50, or `collection` over 64 chars
429	Rate limit exceeded (30/min per IP)	Implement exponential backoff
503	Service warming up	Retry with exponential backoff

Rate limiting pattern for batch use:

The API allows 30 requests per minute per IP. For real-time applications this is rarely a constraint. For batch processing, implement a simple token bucket:

import time
from collections import deque

class RateLimitedClient:
    def __init__(self, requests_per_minute=25):  # Leave headroom below the 30/min limit
        self.rpm = requests_per_minute
        self.timestamps = deque()

    def search(self, question, collection, results=10):
        now = time.time()
        # Remove timestamps older than 60 seconds
        while self.timestamps and now - self.timestamps[0] > 60:
            self.timestamps.popleft()

        # Wait if at limit
        if len(self.timestamps) >= self.rpm:
            sleep_time = 60 - (now - self.timestamps[0])
            if sleep_time > 0:
                time.sleep(sleep_time)

        self.timestamps.append(time.time())

        response = requests.post(
            "https://api.ragionex.com/v1/knowledge/search",
            headers={"X-API-Key": "YOUR_API_KEY"},
            json={"question": question, "results": results, "collection": collection},
            timeout=10
        )
        response.raise_for_status()
        return response.json()

7. Use the `source` and `id` Fields

Every result includes two fields that most integrations ignore.

source - the URL of the original source document. Surface this to your users. An answer with a clickable source link is trusted more than one that appears from nowhere. It also gives users the full document context when they need more detail than the retrieved result provides. Note: the URL points to the raw source file - if you want to show a rendered documentation page, transform it accordingly before displaying.

for result in data["results"]:
    answer = strip_media_descriptions(result['answer'])
    source = result['source']
    print(f"Answer: {answer[:300]}...")
    print(f"Read more: {source}")
    print("---")

id - a stable identifier for each result. Log it alongside every query. When a user reports an inaccurate answer, the id lets you look up exactly which result was returned and why - without having to reproduce the query and guess which result caused the problem.

import logging

logger = logging.getLogger(__name__)

def search_and_log(question, collection):
    data = requests.post(...).json()

    if data["success"]:
        result_ids = [r["id"] for r in data["results"]]
        logger.info("query=%r collection=%s retrieved=%s", question, collection, result_ids)

    return data

This makes debugging bad answers tractable at scale. "Result ID X7KQ2P came back for this query" is a solvable problem. "The answer was wrong" without any identifiers is not.

Summary

Practice	Impact
Query in English	High - non-English queries reduce semantic similarity scores
One topic per question	High - compound questions dilute result relevance
`results: 10` for LLM integration	Medium - more context enables better LLM reasoning
Keep `<media-description>` for LLM	High - enables visual Q&A from images and videos
Strip `<media-description>` for display	Required - prevents raw prose appearing in your UI
Check `success` field	Required - silent failures lead to confusing LLM responses
Log `id` field	Medium - essential for debugging incorrect answers

These practices take ten minutes to implement. The ones that matter most - handling media descriptions correctly and asking focused single-topic questions - are invisible in simple demos but show up immediately when real users ask real questions.

The Developer Preview API is live with a VS Code documentation knowledge base. Try it free at ragionex.com - the API key is on the homepage.

Related reading: for the architectural rationale behind these patterns, see Why We Don't Call an LLM at Query Time. Building agents that also need persistent memory across sessions? Add Persistent Memory to Cursor, Claude Code, and Windsurf in 3 Lines covers the same integration shape for the Memory product.

Have questions? Join the Ragionex Discord or email [email protected].