Memory for AI Agents

Introduction

Ask most developers what makes an AI agent useful and they will describe its reasoning capability — the model’s ability to break down a problem, use tools, and produce a coherent output. That reasoning is real, but it is stateless. Every session, the agent starts over. It does not know that you prefer concise answers. It does not remember that you already tried the approach it is about to suggest. It does not know that the project it is helping you with changed direction last month.

Without memory, an agent is capable but contextless. It can reason well, but it cannot improve with use, adapt to the individual, or maintain continuity across time. Memory is what bridges the gap — between a tool you have to re-explain yourself to every session and a collaborator that actually knows you.

This article is about the mechanics of that bridge: not just what memory types exist (that was covered in the state management article), but how memory actually works across its full lifecycle. How experiences become stored memories. How memories are structured for retrieval. How an agent decides what to surface from its memory store at inference time. How memories stay accurate as facts change and time passes. And what goes wrong when any of these pieces are handled carelessly.

Memory vs. State — Where the Line Is

Before going further, it is worth being precise about how memory differs from state — because the two are related but not the same, and conflating them leads to architectural mistakes.

State is what an agent carries within a task. It is scoped to a single run: the conversation so far, the results of tool calls, the current step in a workflow. When the task ends, state is discarded. It is transient by design.

Memory is what persists beyond a task. It survives session boundaries and accumulates over time. The next time the agent works with the same user, memory is what it draws on to not start from zero.

The practical consequence: state and memory need different storage backends, different update frequencies, and different retrieval patterns. State is written continuously during a task and read at every step — low latency, high frequency. Memory is written at session boundaries, read at the start of a new session, and maintained asynchronously between sessions — durability and retrieval quality matter more than raw speed.

The four memory types — in-context, episodic, semantic, and procedural — were introduced in the state management article. This article does not repeat that taxonomy in full, but builds on it: the focus here is on how each type is formed, stored, retrieved, and maintained, not just what it is.

1. How Memory Forms

Not everything an agent encounters becomes a memory. If it did, memory stores would grow enormous within days, full of noise and redundancy that would make retrieval worse, not better. Something has to decide what is worth preserving — and how to represent it in a form that will be useful when retrieved later.

The Extraction Problem

The raw material of agent memory is unstructured: conversations, tool outputs, documents, events. Useful memories are structured: discrete facts, typed records, named entities with relationships. The process of converting the former into the latter is called extraction, and it is one of the hardest problems in agent memory.

Consider a conversation where a user says: “I usually work in Python, though sometimes Go for performance-critical stuff. I find verbose explanations annoying — just give me the code and a short note on why.” That sentence contains at least three distinct, retrievable facts: a primary language preference, a secondary language preference, and a communication style preference. None of them are labeled. None are in a structured format. Extracting them requires understanding the sentence, identifying what is preference-relevant, and committing each fact as a separate, typed memory record.

This is why memory extraction is expensive — and why most systems that skip it end up with memory stores full of raw conversation summaries that are hard to query precisely.

Three Formation Triggers

Explicit saves are the simplest case: the user or agent deliberately commits something to memory. “Remember that I prefer bullet points.” “Note that this project uses Postgres, not MySQL.” Explicit saves are reliable — the user intended to create a memory — but rare. Most important context is never explicitly flagged.

Event-driven writes happen at defined moments: the end of a task, the close of a session, the resolution of a long-running workflow. At these points, a memory extraction process reviews what happened and decides what is worth preserving in long-term storage. This is a natural checkpoint: the task is done, there is no time pressure, and the context for deciding what matters is clear.

Continuous background extraction runs as an ongoing process parallel to the agent’s main work. A background component monitors recent interactions and extracts facts, preferences, and notable events without waiting for a task to complete. This approach captures memory in near-real time, at the cost of running more LLM inference calls and requiring careful deduplication to avoid creating redundant records.

Extraction Approaches

Summarization compresses a session or conversation into a prose narrative: a paragraph describing what was discussed, what was decided, what remains open. It is fast and readable. The limitation is retrieval precision — a summary captures the gist but buries the details. Searching for “user’s preferred Python testing framework” in a summary store is much harder than searching a structured record store where that fact exists as a discrete entry.

Fact extraction uses an LLM to identify and pull discrete, structured facts from conversation. The output is not a paragraph but a list of records: {type: "preference", content: "prefers bullet points over prose", confidence: 0.9, scope: "user"}. This is harder to generate and costs more upfront, but it pays back significantly at retrieval time — individual facts can be updated, superseded, or deleted without reprocessing entire conversations.

Entity extraction pulls named entities — people, organizations, projects, technologies — and the relationships between them. It forms the basis of graph-based memory systems, where memories are stored not just as facts but as a network of connected knowledge. Entity extraction is the most structured approach and enables a form of retrieval that the other two cannot: traversing relationships to find facts that are connected to what you’re looking for, even if they do not directly match your query.

The Storage-Time vs. Retrieval-Time Tradeoff

A key insight from production memory systems: most implementations do too much work at retrieval time and not enough at storage time. If memories are stored as raw summaries, the system has to do heavy LLM processing at retrieval time to figure out what is relevant. If memories are stored as structured, deduplicated, relationship-linked facts, retrieval can be fast and high-quality — the hard work was done once when the memory was created.

The principle: invest in extraction quality at write time. It pays back on every subsequent inference call.

Deduplication and Merging

When the same fact is encountered multiple times — across sessions, or within a single long conversation — the memory store needs to decide: create a new record or update the existing one? Without deduplication, stores fill up with redundant memories that dilute retrieval results and inflate storage costs.

The simplest approach: before writing a new memory, check for semantically similar existing records. If a match is found above a similarity threshold, update the existing record rather than creating a new one. More sophisticated systems maintain a supersession chain — the old record is marked as superseded, and the new one points back to it — so the history is preserved even as the current fact is updated.

2. How Memory is Stored

Different memory types need different storage backends, and the choice of backend determines what retrieval strategies are available.

Storage Backends by Memory Type

In-context memory lives in the model’s context window — it is not stored anywhere. It is the agent’s immediate working surface, available instantly, and gone when the inference call ends. Everything else in this section is about memory that outlasts the context window.

Episodic memory — records of what happened, when, and with what outcome — is best stored in a relational database. Postgres or SQLite provide structured querying, time-range filtering, and strong consistency. A record of a past agent run, a conversation summary, or a significant user interaction belongs here.

Semantic memory — factual knowledge retrieved by meaning — requires a vector database. Pinecone, Weaviate, Qdrant, and pgvector are all viable backends. Semantic memory is retrieved by embedding similarity: the current query is embedded, and the most similar stored facts are returned. This backend is the right choice for “what do I know about X?” queries.

Procedural memory — instructions, tool definitions, behavioral rules — lives in configuration files, system prompt templates, or a prompt management store. It is retrieved by lookup, not similarity, and changes infrequently.

Graph-based memory stores entities and relationships as nodes and edges in a knowledge graph. Neo4j, FalkorDB, and Amazon Neptune are common backends. Graph storage uniquely represents how facts relate to each other — not just that “the user works on a fintech product” and “the user’s team uses Postgres,” but that these two facts are connected through the same project entity. This enables relational retrieval that vector stores cannot provide.

What a Well-Structured Memory Record Contains

The fields in a memory record are not arbitrary — each one enables a specific capability downstream:

content — the memory itself, as a structured fact or short narrative
type — episodic, semantic, or procedural; determines which retrieval strategy applies
scope — global (applies to all users), user-specific, session-specific, or agent-specific
source — where the memory came from: conversation, tool result, explicit user input, external document
created_at and updated_at — when the fact was first stored and last modified
valid_from and valid_until — for facts that are true for a period and then superseded; a fact without a valid_until is assumed to be currently true
confidence — how reliable this memory is, based on how it was extracted and from what source
retrieval_count — how many times this memory has been retrieved; used for importance scoring and decay

The valid_from / valid_until pair deserves special attention. Most memory systems store facts as if they are permanently true once written. Reality is messier — people change jobs, projects change direction, preferences evolve. A memory schema that supports temporal validity windows can represent that “the user worked at Company X as of January” without asserting it is still true in December.

Memory Layering in Practice

Most production agent systems use a layered storage architecture rather than a single backend:

Redis for current-session working state — sub-millisecond access, volatile by design
Postgres for structured episodic records — timestamped, queryable, durable
A vector store for semantic memory — similarity search, scalable to millions of records
A graph database (for more sophisticated systems) — relational knowledge, temporal facts, entity networks

The tradeoff between a managed memory service (Mem0, Zep) and a custom stack is control vs. operational burden. Managed services handle extraction, deduplication, and retrieval; custom stacks give full control over every layer at the cost of building and maintaining each component.

3. How Memory is Retrieved

If memory formation is the write path, retrieval is the read path — and it is where most memory systems either succeed or fail in production. The challenge is not finding relevant memories; it is finding the right memories, for the right moment, within a fixed token budget, with low latency.

Bad retrieval is actively worse than no retrieval. Injecting irrelevant or contradictory memories into the context window adds noise that the model has to reason around, and it is not always good at ignoring it.

The Retrieval Window

At inference time, the agent has a fixed number of tokens it can allocate to injected memories. This budget is typically small relative to the full memory store — a few hundred to a few thousand tokens. Retrieval must identify the most relevant memories and compress them into that budget without losing what matters.

This constraint drives everything about retrieval design: what signals to use, how many memories to retrieve, how to rank them, and whether to summarize or inject verbatim.

Three Core Retrieval Strategies

Semantic search embeds the current query and retrieves memories whose embeddings are most similar by vector distance. It is the most common retrieval approach and works well for conceptual relevance — finding memories about a topic even when the exact words differ. Its weaknesses: it can miss exact matches (a specific project name, a precise technical term), it has no native sense of time, and it degrades as the vector store grows without curation.

Keyword search (BM25) uses term frequency and inverse document frequency to rank results by lexical match. It is fast, requires no LLM calls, and catches things semantic search misses: proper nouns, version numbers, specific identifiers. The weakness: it misses conceptual relevance — searching for “preferred output format” will not find a memory about “liking bullet points” unless those exact words appear.

Graph traversal follows entity relationships in a knowledge graph to surface connected facts. If the current task involves a project the agent has worked on before, graph traversal can pull not just memories explicitly about that project but also memories about the team, the tech stack, and the constraints — because those are connected entities in the graph. This is the retrieval strategy that most closely mirrors how humans recall related context.

Hybrid Retrieval

The most reliable production approach combines all three strategies: semantic search for conceptual relevance, BM25 for exact matches, graph traversal for relational context. The results are merged, re-ranked, and filtered before injection.

Zep’s Graphiti implementation is the most benchmarked example of this approach. It combines vector search, BM25, and graph traversal with no LLM calls at retrieval time, achieving 94.7% accuracy on the LoCoMo benchmark at roughly 155ms P95 latency. The absence of LLM calls at retrieval time is deliberate — it is what makes the latency low and predictable enough for production use.

Temporal Retrieval

A fact retrieved without temporal context may be true, outdated, or superseded. Good retrieval systems are temporally aware: they prefer recent memories over old ones when recency is relevant, they respect valid_until fields, and they surface when a memory is old enough to warrant verification rather than treating all memories as equally current.

This is especially important for facts that change: the user’s current role, the status of an ongoing project, a technology choice that may have been revisited. A memory system without temporal retrieval will surface these facts as if they are still true indefinitely.

Importance-Weighted Retrieval

A memory that has been retrieved and acted on many times is likely to be relevant again. A memory that has never been retrieved despite existing for months is probably not useful. Importance scoring combines semantic similarity, recency, and retrieval history into a single ranking signal — preferring memories that are relevant and have a proven track record of usefulness.

The Injection Step

Retrieved memories are inserted into the context window before the model generates its response. Two common patterns:

A dedicated memory section in the system prompt: a block of structured facts prepended to the system message, presented as established context the model should take as given. Clean, predictable, easy to inspect.

Woven into the conversation: memories injected as assistant turns early in the conversation history, simulating a prior discussion. More natural-feeling, harder to audit.

The system prompt approach is generally preferable for production systems because it is explicit and debuggable. When something goes wrong, you can read the system prompt and see exactly what the model was told.

Retrieval Failure Modes

Context pollution — too many memories retrieved, filling the token budget with marginally relevant content that dilutes the signal
Retrieval gaps — the right memory exists but is not retrieved because the query embedding was not close enough, the exact keyword was not matched, or the memory was ranked below the cutoff
Stale retrieval — an outdated fact is retrieved and treated as current; no temporal filter caught it
Contradictory retrieval — two memories that conflict with each other are both retrieved and injected; the model is left to decide which to believe, and it may not choose correctly

4. Memory Across Sessions

Cross-session memory is what makes an agent feel like it actually knows you. It is the difference between a conversation that starts from scratch every time and one that picks up where the last one left off.

What Should Cross Session Boundaries

User preferences and communication style — how the user likes to receive information, their level of expertise in relevant domains, their formatting preferences. These change slowly and are broadly applicable.

Established context — the user’s role, the projects they work on, the team they are part of, the tech stack they operate in. This is the background knowledge that makes the agent’s responses relevant without requiring re-explanation.

Past decisions and their outcomes — what the user tried, what worked, what did not. An agent that can say “you tried approach A last month and ran into problem X — that might not be the right direction here” is qualitatively more useful than one that can only reason from the current conversation.

Open threads — tasks that were started but not completed, questions that were deferred, follow-ups the user mentioned. These are easily lost without explicit cross-session tracking.

What Should Not Cross Session Boundaries

Task-specific working state — the intermediate results, tool outputs, and scratchpad content from a completed task. Once the task is done, this is no longer useful and adds noise to future retrieval.

Sensitive one-time context — a password mentioned in passing, a private detail that came up incidentally. This should never be committed to long-term memory.

Time-bound information — current news, a deadline that has passed, a metric as of a specific date. Storing these as if they are permanently true creates stale memories that mislead the agent months later.

Session Summarization: Narrative vs. Fact Extraction

At the end of a session, the agent needs to decide what is worth carrying forward. Two approaches:

Narrative summarization produces a paragraph describing what happened: topics covered, decisions made, questions resolved. It is fast to generate and easy to read. The limitation is retrieval — a paragraph about a session is hard to query precisely. Asking “what did the user say about their testing framework?” requires reading the whole summary to find out.

Fact extraction produces a list of structured records distilled from the session. Each record is a discrete, queryable fact: scoped, typed, timestamped. This is harder to generate (it requires more LLM inference), but each fact is independently retrievable, updatable, and expirable. When the user’s preferences change, only the relevant records need to be updated — not the whole summary rewritten.

In practice, many systems use both: a narrative summary for human-readable session history, and extracted facts for machine retrieval.

The Cold Start Problem

A new user has no memory history. The agent has no personalization to draw on and no established context to reference. Strategies for handling this gracefully:

Explicit onboarding — ask the user a small number of high-value questions at the start of the first session (role, domain, preferences) and immediately commit the answers to memory. Even three or four facts dramatically reduce the cold start effect.

Fast inference from early interactions — treat the first few exchanges of a new session as particularly information-rich. Watch for signals: the vocabulary the user uses, the questions they ask, the level of detail they provide. Commit inferences to memory quickly, with lower confidence, and update them as they are confirmed or contradicted.

Layered retrieval — even with no user-specific memory, the agent can draw on global semantic memory (domain knowledge, world facts, general expertise) while user-specific memory is still sparse. The personalization layer builds over time; the knowledge layer is available from day one.

5. Forgetting and Memory Management

Memory without management becomes a liability. Stale facts mislead the agent. Contradictory memories produce inconsistent behavior. Redundant records degrade retrieval quality by diluting relevant results with noise. An ever-growing memory store that is never curated is worse than a small, well-maintained one.

Forgetting is a feature, not a failure. The goal is not to remember everything — it is to remember what is true, relevant, and useful.

TTL and Expiry

Every memory should have a defined lifespan, and that lifespan should be calibrated to the type of fact:

Task checkpoints and working state: hours
Session summaries: days to weeks
Preferences and communication style: months
Core user context (role, domain, recurring projects): indefinite, but subject to periodic review
Time-bound facts (a deadline, a current metric, a status update): set to expire at the relevant date

TTL is the simplest and most reliable management strategy because it requires no judgment at eviction time — the memory simply expires and is removed or archived. The hard part is calibrating the right TTL for each memory type, which requires observing how quickly facts in each category actually become stale in your specific domain.

Decay Scoring

Decay scoring does not delete memories — it de-prioritizes them. Memories that are never retrieved gradually score lower in retrieval rankings. Memories that are retrieved frequently stay prominent. This mirrors how useful information stays accessible in human memory while disused knowledge fades.

A reasonable production policy: combine TTL for hard expiry of clearly time-bound facts, with LRU-style decay for everything else. TTL bounds storage costs; decay bounds retrieval noise. Together, they handle most of the memory management problem without requiring active curation of individual records.

Active Supersession

When a new fact contradicts an existing memory, the old memory must be explicitly superseded — not just overwritten. Overwriting loses the history; leaving both creates contradiction. Supersession marks the old record as no longer current while preserving it in the store with a valid_until timestamp and a pointer to its replacement.

The three-step process for every memory write:

Detect: compare the incoming fact against existing memories using semantic similarity and entity matching. Does this new fact conflict with anything already stored?
Adjudicate: determine whether the new fact contradicts the existing one or merely adds to it. Not all overlap is conflict. “User works at Company X” and “User is a senior engineer at Company X” are compatible; “User works at Company X” and “User works at Company Y” are not.
Resolve: if it is a contradiction, mark the old record as superseded (valid_until = now), write the new record with a supersedes pointer, and preserve both for audit purposes.

Summarization and Consolidation

As episodic memory grows — dozens of session records, hundreds of extracted facts — the retrieval signal degrades. Many granular memories about the same topic can be merged into a single higher-level record that captures the essential information without the noise.

Consolidation is typically done asynchronously: a background process identifies clusters of related memories, generates a consolidated summary, writes it as a new memory with higher confidence, and marks the constituent records as archived. The consolidated record is what gets retrieved in future sessions; the originals are kept for audit purposes but de-prioritized.

The Problem of Slow Forgetting

The most common real-world failure mode in production memory systems is not forgetting too aggressively — it is remembering outdated facts too long.

An agent that learned “user works at Company X” in January and stores it without a validity window will still be referencing it confidently in December, long after the user changed roles. The agent is not wrong because it forgot; it is wrong because it never forgot. The outdated memory is retrieved with the same confidence as one written yesterday.

The fix is not complicated, but it requires deliberate policy: TTLs on facts that have natural expiries, periodic review of high-importance memories, and a mechanism for users to see and correct what the agent believes about them.

6. Memory for Personalization

The most visible payoff of long-term memory — and the one most users notice — is an agent that adapts to the individual over time without being explicitly retrained.

What Personalization Memory Looks Like

Communication preferences: how the user wants information delivered. Prefers bullet points or prose? Wants code examples or conceptual explanations? Finds caveats useful or annoying? These preferences are low-cost to store and high-value to retrieve — they apply to almost every response.

Domain expertise: what the user already knows. An agent that pitches its explanations at the right level — not over-explaining to an expert, not under-explaining to a newcomer — is qualitatively more useful than one that calibrates fresh every session. Expertise levels often differ by domain: a user might be a senior engineer in Python but a beginner in machine learning.

Recurring context: the persistent background of the user’s work. The projects they are involved in. The team they collaborate with. The tech stack they operate in. The constraints they work under. This context applies across many different tasks and should not need to be re-established in every conversation.

Past decisions and outcomes: what the user tried, what they chose, what worked, and what did not. An agent with this memory can avoid recommending approaches the user already rejected, reference decisions they made together, and build on prior work rather than starting from scratch.

The Feedback Loop

Personalization memory improves through use — but only if the agent has a mechanism for converting user feedback into durable memory.

Explicit feedback is the clearest signal: the user says “that was too long,” “I already know this,” or “please stop adding disclaimers.” Each of these should trigger an extraction that stores a preference record and adjusts future behavior. The loop is: feedback → extraction → stored preference → adjusted behavior.

Implicit signals are subtler but richer. The user regularly edits a specific part of the agent’s output — that is a preference signal. The user never uses the agent’s recommended tool and always substitutes a different one — that is a pattern worth storing. Capturing implicit signals requires deliberate instrumentation: logging what the user does with the agent’s output, not just what the agent produced.

Outcome tracking closes the loop over longer timescales: the agent records what advice it gave, and where observable, what happened as a result. An agent that can reason about the outcomes of its past recommendations — and update its approach when they were wrong — is a qualitatively different kind of system.

Risks of Memory-Driven Personalization

Entrenchment: if the agent forms a wrong assumption about a user early on, every subsequent interaction may reinforce rather than correct it. The model retrieves the wrong memory, acts on it, and generates another interaction that confirms the wrong pattern. Breaking out of this requires either the user explicitly correcting the memory, or the agent having a mechanism to question its own stored assumptions.

Staleness: people change. A communication preference from a year ago may no longer reflect who the user is today. A domain expertise level stored when the user was a beginner may still be retrieved long after they became an expert. Memory without expiry or periodic review is a snapshot of who the user was, not who they are.

Over-personalization: an agent tuned too tightly to a user’s past preferences may stop offering alternatives, challenging assumptions, or suggesting approaches outside the established pattern. Personalization should adapt the agent’s behavior, not constrain it.

User Control Over Memory

Users should be able to see what the agent remembers about them. Not a raw database dump — a readable summary of stored preferences, established context, and notable past interactions. They should be able to correct individual memories, mark preferences as no longer current, and delete what they do not want retained. This is both a good product design principle and, in many jurisdictions, a legal requirement.

7. Memory in Multi-Agent Systems

When multiple agents collaborate on a task, memory becomes a coordination problem. The question is no longer just what the agent remembers — it is what each agent in the system can see, what they share, and what happens when they write conflicting information to a shared store.

Private vs. Shared Memory

Private episodic memory belongs to a single agent and records its own past interactions and experiences. A customer support agent and a research agent working in the same system should not have access to each other’s interaction history — their episodic memories are their own.

Shared semantic memory is a common knowledge base that all agents in the system read from. Company policies, product documentation, user preferences at the task level, domain facts — these are global and should be consistently available to every agent that needs them.

Shared working context — the current task state, the outputs of prior steps — is passed explicitly as part of the handoff between agents, not stored in long-term memory. It is scoped to the task, not the agent.

Memory Handoff Patterns

When one agent delegates to another, the receiving agent needs sufficient context to do its job. Three patterns:

Summary injection: the orchestrating agent summarizes the relevant context and includes it in the task input to the sub-agent. Clean, controlled, and explicit — the sub-agent receives exactly what it needs and nothing it does not. The limitation is that the summary may miss context the orchestrator did not anticipate being relevant.

Shared memory access with scoped retrieval: both agents have access to the same memory store, but each queries it independently with their own retrieval logic. Flexible, but requires trust that the sub-agent’s retrieval will surface the right context.

Memory pass-through objects: a structured context object is built at the start of the task and passed through the agent pipeline, each agent adding its outputs as it goes. This is essentially state management applied to memory handoffs — predictable but requires a well-designed schema.

The Shared-Write Problem

If multiple agents write to the same long-term memory store, one agent’s incorrect extraction can corrupt the information that all other agents retrieve. An agent that confidently stores a wrong fact — extracted from a misunderstood tool output, or hallucinated — creates a poison record that spreads through the system.

The safest mitigation: route all memory writes through a single trusted memory manager agent. No agent writes directly to long-term memory; they submit memory candidates to the manager, which validates, deduplicates, and applies contradiction checking before committing. This is more overhead but dramatically reduces the risk of silent memory corruption in a multi-agent pipeline.

Memory Isolation in Multi-Tenant Systems

In systems where multiple users’ agents share the same infrastructure, memory isolation is a correctness and security requirement. One user’s memories must never be retrievable by another user’s agent — not through sloppy scoping, not through embedding collision, not through a misconfigured namespace.

Memory scoping fields (user_id, session_id, agent_id, tenant_id) must be applied as hard filters on every retrieval query, not as soft ranking signals. A memory store that returns results across scope boundaries — even low-scoring ones — has a fundamental isolation failure.

8. Tooling

Mem0

Mem0 is a purpose-built memory layer for LLM applications. It handles the full memory lifecycle behind a single API: extracting facts from conversations, storing them at user, session, and agent scopes, deduplicating on write, and retrieving relevant memories at inference time. It integrates with LangChain, LlamaIndex, and direct API calls, and is maintained as an active open-source project.

The right choice for teams that want production-grade memory without building extraction, deduplication, and retrieval from scratch. Less control over the internals; significantly less work to get to a working system.

Zep / Graphiti

Zep’s underlying engine, Graphiti, implements a temporal knowledge graph architecture. Facts are stored as nodes and edges with explicit validity windows — when a fact was true, and when (if ever) it was superseded. Retrieval combines semantic vector search, BM25 keyword search, and graph traversal with no LLM calls at retrieval time, which is what enables its sub-200ms P95 latency in production.

The temporal knowledge graph approach is uniquely good at representing how facts relate to each other and how those facts change over time. It is more infrastructure-intensive than Mem0 — it requires a graph database backend — but it provides capabilities that flat vector stores cannot: relational retrieval, temporal filtering, and supersession chains.

Well-suited for agents that need to reason about changing facts and entity relationships, or for enterprise deployments where auditability of memory state is important.

LangMem

LangChain’s memory management library, integrated directly with LangGraph. It provides in-thread memory (within a single agent run) and cross-thread memory (across sessions) with configurable extraction and namespace scoping. The natural choice for teams already building on LangGraph who want memory to feel like a native part of the framework rather than an external service.

Letta (formerly MemGPT)

Letta is built around explicit, programmatic control over the context window. Rather than treating the context window as a passive container, Letta gives the agent agency over what enters and exits it — the agent can request that specific memories be loaded or unloaded as the task evolves. This is the right choice when context window management is the primary engineering challenge — long tasks, large knowledge bases, agents that need to actively reason about what they know and what they need to know.

Custom Implementations

For teams with specific requirements — strict data residency, unusual retrieval patterns, tight latency requirements — a custom stack remains viable. The typical composition: Redis for current-session state, Postgres for episodic records and structured facts, a vector store (pgvector, Qdrant, Pinecone, Weaviate) for semantic retrieval, and optionally a graph database (Neo4j, FalkorDB) for relational knowledge.

The tradeoff: full control over every component, at the cost of building and maintaining extraction, deduplication, contradiction resolution, and decay logic yourself. These are solved problems in dedicated tools — rebuilding them from scratch is only warranted when the constraints genuinely require it.

9. Challenges to Keep in Mind

Hallucinated Memories

An agent may “remember” things that were never stored — confabulating past interactions with the same confidence it expresses when retrieving real ones. This is not a retrieval failure; it is a model behavior failure. The model generates a plausible-sounding memory that does not exist in the store.

The mitigation is architectural: treat retrieved memories as evidence to reason from, not as ground truth to assert. When the agent cites a memory, it should be retrievable and auditable. If it cannot be traced to a stored record, it should be flagged as uncertain rather than stated as fact.

Stale Memories

Facts that were true when stored may no longer be true. The agent has no built-in mechanism to detect this — it will retrieve and act on a stale fact with exactly the same confidence as a current one. Temporal validity windows, TTLs, and periodic review are the structural defenses; periodic human audits help for high-stakes memory stores. The failure mode — confidently acting on outdated information — is subtle and hard to detect without deliberate monitoring.

Contradictory Memories

Two stored facts that conflict. Without active contradiction resolution on writes, these accumulate silently. The agent may give different answers on different days depending on which contradictory memory is retrieved, producing behavior that is inconsistent and hard to debug. Contradiction resolution at write time is significantly cheaper than trying to manage it at retrieval time.

Privacy and the Right to Be Forgotten

Long-term user memory is sensitive data. It must be encrypted at rest, access-controlled, and — critically — deletable. When a user requests deletion of their data, the deletion must be complete: not just de-indexed from the vector store, but removed from every layer of the storage architecture. Soft deletion (de-indexing without removal) leaves the data accessible to anyone with direct database access.

In jurisdictions with data protection regulations (GDPR, CCPA), the right to erasure applies to stored agent memories as much as to any other personal data. Build deletion into the architecture from the start; retrofitting it is significantly harder.

Scale and Retrieval Degradation

Retrieval quality is not monotonically better as the memory store grows. A large, unmanaged store will return noisier, less relevant results than a smaller, well-curated one. Without active management — TTLs, decay scoring, consolidation — the signal-to-noise ratio in the store degrades over time, and the agent’s behavior degrades with it.

Memory management (forgetting, consolidating, superseding) is an ongoing operational requirement for any long-lived agent system. It is not something you set up once and forget.

Trust Calibration

Should the agent trust its own memories unconditionally? The answer is no — and the reasoning matters. A memory extracted from a conversation may have been extracted incorrectly. A fact stored months ago may have changed. A memory that has never been verified may be based on a misunderstanding.

In high-stakes contexts, retrieved memories should be surfaced to the user for confirmation before being acted on, especially for consequential decisions. An agent that says “based on what you told me last month, I recommend X — is that still accurate?” is more trustworthy than one that acts as if all its memories are authoritative facts.

Conclusion

Memory is what gives agents continuity — across sessions, across tasks, and across time. Without it, every conversation starts from zero. With it, an agent builds a working model of who it is collaborating with and what they are trying to accomplish, and that model gets more accurate the more it is used.

The implementation challenge is not storage — the databases and frameworks for that are mature and well-understood. The challenge is the full lifecycle: deciding what is worth remembering, extracting it into a form that can be precisely retrieved, surfacing the right context at the right moment, and maintaining the accuracy of the memory store as facts change and time passes.

Every stage of this lifecycle can fail in ways that are invisible without deliberate monitoring: stale memories retrieved as current, contradictions that accumulate silently, extraction that creates noise rather than signal, retrieval that surfaces the wrong context. The mitigations are known — temporal validity windows, contradiction resolution on write, hybrid retrieval, active decay and expiry — but they require deliberate engineering, not just the default behavior of whatever framework you are using.

Done well, memory makes an agent feel qualitatively different. Less like a tool you have to re-explain yourself to every time you open it. More like a collaborator that actually knows what you are working on, remembers what you have tried, and gets better the more you work together.