KnowledgeMemoryBlogPricingDocs Log in

Persistent memory without the vector database.

The default advice for ‘give my agent memory’ in 2026 is operationally heavier than the problem deserves. A three-way comparison of self-hosted vector stores, knowledge graphs, and managed semantic memory APIs - with a recommendation for which to reach for first.

If you have read three blog posts about AI agent memory in the last six months, you have read three variations of the same paragraph. Provision a vector database. Pick a retrieval model. Build an ingestion pipeline. Configure your index. Decide how to slice your documents. Manage your refresh cadence. The advice is technically correct - those are the moving parts of a self-hosted retrieval stack - but as default advice for “I just want my agent to remember the user's tone preferences” it is wildly overspecified. It is the architectural equivalent of being told to lay down a foundation slab before you can hang a picture frame.

The backlash, when it came, was loud. In March 2026, VentureBeat covered a Google PM open-sourcing an “always on memory agent” that ditched vector databases entirely in favour of a flatter file-based store. A Towards Data Science writeup of memweave - markdown plus SQLite, no vector DB required - hit the front page of the major aggregators inside forty-eight hours. The Hacker News response to both was a long, mostly-agreeing thread whose top comment is now part of the cultural record: “The vector database is the most overprovisioned dependency in modern AI infra.” For a class of problems that is true. The question is which class. The shape of the answer for most teams is a managed semantic memory API - the shape Ragionex exposes through POST /v1/memory/write and POST /v1/memory/search, replacing the self-hosted retrieval stack.

What “give my agent memory” actually means in production

Before recommending an architecture, it is worth being precise about the workload. When a developer says they want their agent to remember things, the requirements decompose into roughly three buckets, in descending order of frequency.

The most common case, by a wide margin, is preference and decision recall. The user told the agent on Tuesday that they prefer Postgres to MongoDB. Six weeks later, on a different machine, in a different repo, the agent should not have to ask again. The store is small (kilobytes per user), updates are infrequent, queries are by meaning rather than by exact key, and latency matters but not at the millisecond level. This is what most real agents need most of the time. It is also the case that the “just stand up Pinecone” advice serves worst, because the operational floor of a self-hosted vector stack is several orders of magnitude above the actual workload.

The second case is document recall over a corpus the team controls. The agent needs to answer questions about a five-thousand-page manual, a customer's product docs, a research library. Volume is real, refresh cadence is non-trivial, the team has opinions about ranking. This is the case where a self-hosted vector stack starts to earn its keep, and where the tradeoff between latency, recall quality, and operational ownership becomes a real decision. We covered this side of the design space at length in RAG vs Context Engine; the short version is that document recall and agent memory are different workloads even when they look superficially similar.

The third case is relational reasoning - the agent needs to know that the user works at Acme, that Acme deploys on Cloudflare Workers, that Workers has a 50ms CPU limit, and to chain those facts together to answer a question that touches none of them directly. Similarity search struggles here in a way that experienced teams know about and beginners discover the hard way. Knowledge graphs are the right primitive for this, even though the schema design is a serious upfront investment.

Three workloads, three architectures, no single right answer. The mistake the industry made for two years was pretending that the answer to all three was “a vector database with a retrieval pipeline bolted on.” It is the right answer to one of them and an operational tax on the other two.

The honest three-way comparison

Here is the decision matrix the way I would draw it on a whiteboard, with the tradeoffs spelled out the way someone shipping production agents actually weighs them.

Self-hosted vector store (Pinecone, Weaviate, Qdrant, pgvector)

You stand up the index, you choose the retrieval model, you own the ingestion pipeline, you control the ranking. The ceiling on quality is high - if your team is willing to invest in tuning, the recall floor is very, very good. The trade is operational: you are running a stateful service, you are managing a vector schema, and you have inherited a pile of decisions about document slicing, hybrid search, and re-ranking that you will be making opinionated bets on for as long as the system lives.

Reach for this when at least two of the following are true: latency floor below 50ms is a hard product requirement, recall ranking is differentiated value (not table stakes), the team has DBA-grade bandwidth to operate a stateful service, the dataset is large enough that tuning actually matters.

Knowledge graph (Letta-style, Zep's Graphiti, Neo4j-backed)

If your agent's value is in connecting facts rather than recalling them in isolation, the graph is the right shape. User works at Acme and Acme runs on Cloudflare Workers are two edges in a graph, and the question “does the user need to worry about Workers CPU limits” is a multi-hop traversal that no similarity score on its own will reliably find. Letta has built its identity around this argument, and the argument is sound for the workloads it targets.

The cost is the schema. Graph quality lives or dies by the entity extraction pipeline that turns raw text into nodes and edges, and that pipeline is not free. You are signing up for a long-term investment in entity resolution, edge-type taxonomy, and graph maintenance. For agents that need relational reasoning the investment pays off. For agents that need to remember that the user prefers JetBrains Mono, it is shooting an aphid with an artillery piece.

Managed semantic memory API

The third option, which only became a serious category in 2026, is to treat memory the way you treat email delivery or analytics: an HTTP endpoint that handles the storage, indexing, and retrieval as a managed service. Your agent makes one POST to write a fact and one POST to recall by meaning. There is no index to provision, no retrieval pipeline to tune, no ranker to maintain. The trade is the inverse of the self-hosted case: you give up some control over the ranking knobs in exchange for the rest of your team's afternoon.

Reach for this when team time is the scarce resource, when the workload is the “preference and decision recall” case described above, when you want one HTTP integration instead of a stateful service to operate, and when the operational floor of a self-hosted retrieval stack exceeds the actual problem you are solving. For most agents in 2026, that describes the actual situation honestly.

The default architecture should match the median workload. The median workload is not a thousand-page manual. It is a hundred kilobytes of preferences and decisions per user.

Why this debate keeps coming back

The conflation of memory with vector databases happened because retrieval-augmented generation became the dominant pattern in 2023, and the industry reused the same architecture for an adjacent problem without checking whether the architecture actually fit. RAG over a customer's documentation is fundamentally a search problem over a large, slowly-changing corpus. Agent memory is fundamentally a small, append-frequently, query-by-meaning workload. Treating them as the same architecture leaves both badly served.

This matters operationally because the cost curves are different. A self-hosted vector index running 24/7 is a fixed cost regardless of whether anyone is querying it. A managed memory API only charges when there is a write or a read. For most agent workloads the duty cycle is low - a handful of writes per session, a handful of reads per turn - and the fixed-cost model is paying for capacity that is mostly idle.

It also matters for failure modes. When your self-hosted vector index goes down, your agent's memory disappears. When your ingestion pipeline gets a malformed document, your retrieval quality degrades silently for hours. When your retrieval model gets deprecated upstream, you have a re-ingestion project on the calendar. None of those failure modes are unique to memory-as-a-managed-service - the managed service has its own list - but the surface area is different, and the team's ability to debug and recover is different. The security model is also different, and worth thinking about explicitly.

What the integration actually looks like

If you have only ever done memory the self-hosted way, the managed-API version feels suspiciously small. It should. Here is the entire integration to give an agent persistent semantic recall scoped to a project:

curl -X POST https://api.ragionex.com/v1/memory/write \
  -H "X-API-Key: rgx_memory_..." \
  -H "Content-Type: application/json" \
  -d '{
    "content": "User prefers TypeScript strict mode on every new project. Has explicitly rejected loose mode three times in the past month.",
    "project": "user-preferences"
  }'

curl -X POST https://api.ragionex.com/v1/memory/search \
  -H "X-API-Key: rgx_memory_..." \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How does the user feel about TypeScript configuration strictness?",
    "scope": "segment",
    "results": 3,
    "project": "user-preferences"
  }'

Two HTTP calls, no infrastructure to operate, no index to tune. The store handles the indexing in the background; on the agent side, write returns a memory id immediately and search returns the most relevant memories, ranked by relevance. If you need to reach across projects - the user's preference about TypeScript probably applies on every repo - you omit the project field on search and the API queries the global pool for that user.

When you actually do need a vector database

This is not a hit piece on Pinecone or Weaviate. They are excellent products that are appropriate for a class of problem they are designed for. If you are running retrieval over a multi-million-document corpus where ranking quality is differentiated value, you should not be reading a managed-memory blog post. You should be reading the Pinecone documentation, hiring someone who understands hybrid search, and budgeting for the ongoing operational cost of a stateful retrieval system. That is what those products are for, and they are good at it.

The argument here is narrower. The default advice for “my agent should remember things” should not be “stand up a vector database.” It should be “use a managed memory API for the small workload, and graduate to a self-hosted index when you have evidence that the small workload is not the workload you actually have.” The graduation point is a real engineering decision based on real data, not a cargo-culted starting condition.

The recommendation, unhedged

For 80% of agents being built in 2026, the right starting architecture is a managed semantic memory API with a free tier large enough to validate the use case. If, after a few weeks of real traffic, your workload turns out to be the multi-million-document case, you graduate to a self-hosted vector store with the data you have collected to justify it. If your workload is fundamentally relational - entity-entity facts that compose into multi-hop questions - you graduate to a knowledge graph and accept the schema-design cost. The mistake to avoid is the inverse: standing up Pinecone for a workload that is two megabytes of user preferences, then spending six weeks on operational scaffolding for a problem you do not have.

The amnesia problem most teams are actually solving is small. The default solution should be small to match. The vector database is not wrong as a primitive; it is wrong as a default. The deeper reframe of what is actually being asked here is worth its own treatment, but for the architectural decision in front of you today, the saner default is one HTTP call.

Ready to try it?

Free API key. No credit card. Start using in seconds.

Get Started