Jan 7, 2026

12 minutes read

Jan 7, 2026

12 minutes read

AI Agent Memory: A Complete Guide

Vasilije MarkovicCo-Founder / CEO

An AI agent without memory starts every task from zero. It can't recall yesterday's customer, reuse last week's debugging insight, or notice that this is the third time the same bug has surfaced. Most agent failures in production trace back to this — not to the model, but to the system around it.

This guide covers the landscape of AI agent memory end-to-end: what the pieces are, how they fit together, how to build or buy one, and where each approach breaks. Each section orients; the linked deep-dives go further. If you want a narrative on-ramp instead, our short AI memory in five scenes post sketches the same territory with concrete vignettes.

What is AI agent memory?

AI agent memory is the infrastructure that lets an agent retain and recall information across a conversation, a session, or a lifetime. It spans three connected concerns: what the agent has just seen (context window), what it has seen before (past interactions), and what it has structurally learned about its domain (knowledge).

The word "memory" gets used loosely in agent literature. In practice it covers at least four distinct mechanisms — context windows, vector-indexed recall, structured knowledge graphs, and feedback-driven learning — each solving a different problem. Understanding which one you need is the first step to building an agent that doesn't start from zero every morning. We've written separately on why agent memory breaks when these layers are conflated.

The four kinds of memory an agent might use

Most working agent systems combine several of these. Each has a job.

Short-term memory is the current context window — everything the model can "see" right now. Useful for coherent conversations, useless after the session ends. Scales with token budget, not intelligence.

Long-term memory is persistent storage that survives across sessions. In most implementations it's a vector database holding chunks of past conversations, retrieved by similarity on the next question. This is what people usually mean when they say their agent "has memory."

Long-term knowledge is long-term memory with structure — entities, relationships, versions. It answers questions that require connecting facts across sources, not just recalling a paragraph. The distinction between memory and knowledge is the single most important one in this guide, and it has its own post on why memory alone isn't enough for reasoning agents.

Semantic, episodic, and procedural memory is a cognitive-science taxonomy applied to AI. Semantic = facts ("the customer's plan is Pro"). Episodic = past interactions ("last Tuesday we resolved a Redis timeout"). Procedural = learned behaviors ("when this error appears, roll back the config"). A mature agent memory system represents all three; we go deeper on the framing in cognitive architectures for language agents and LLM memory and cognitive architectures.

	Storage	Retrieval	Lifetime
Short-term memory	Context window	Implicit (in-prompt)	Single session
Long-term memory	Vector DB	Similarity search	Persistent, flat
Long-term knowledge	Graph + vector hybrid	Graph traversal + similarity	Persistent, versioned
Feedback / procedural	Weighted graph / feedback log	Reinforcement on past outcomes	Continuously updated

Memory vs. knowledge

Memory stores what was said. Knowledge captures what it means.

An agent with long-term memory can recognize a repeated question. An agent with long-term knowledge can answer a new one by connecting facts it never saw together in a single conversation. The difference is structure: chunks versus typed entities and relationships.

This distinction matters most on multi-hop questions — the kind that require stitching information across sources. For the full argument with concrete examples and code, read why structured knowledge outperforms flat memory on reasoning tasks.

Why a bigger context window isn't memory

The default response to "my agent forgets" is to throw more context at it. It works poorly for three reasons that compound.

Cost scales linearly with tokens. A million-token context, used seriously across many turns, is an expensive way to pretend you have memory.

Latency scales too. Agents that load a bloated context on every turn are noticeably slower, and the slowdown compounds on multi-step tasks.

Context windows read; they don't learn. Nothing is consolidated between turns, nothing is deduplicated, nothing improves over time. Rereading the same transcript a hundred times is not the same as understanding it once. We've argued this case at length in the context-engineering era.

A context window is working memory. Persistent memory is a different system.

RAG, vector search, and where they fall short

Retrieval-augmented generation pulls the top-k most similar chunks from a vector store and hands them to the model. It's a solid default for questions that match a paragraph verbatim — FAQ lookups, documentation search, knowledge-base retrieval. We covered the underlying mechanics in vector databases explained and vectors and graphs in practice.

It falls short on multi-hop reasoning. Ask a RAG system "who resolved the last billing bug on this account?" and you get back paragraphs that mention billing. The model stitches a guess about the relationships. If three tickets describe the same incident from different angles, RAG has no concept that they're about the same thing.

The fix isn't replacing vector search — it's pairing it with something that tracks structure. Hybrid approaches retrieve chunks for fuzzy matches and graph edges for structural queries, using the right one per question. Cognee's GraphRAG approach is one concrete take on the hybrid pattern.

What belongs in an AI agent knowledge base

A knowledge base for a human is a searchable pile of articles. A knowledge base for an AI agent is different: a structured, queryable record of what the agent needs to reason — not a library it has to re-read every turn.

Three layers of content belong in one. Reference data is the stable-ish domain record (customers, products, schemas). Operational data is the running history of what the agent has seen, decided, or done. Feedback data is the quality signal that tells the system which recalls were useful. Skip any layer and the agent plateaus.

For the full architecture, what actually goes into each layer, and how ingestion works, read about the three-layer structure of an agent knowledge base.

How memory systems compare: benchmarks and evidence

Every vendor claims their memory approach works. The honest question is: does it hold up when measured?

Cognee published a paper (Markovic et al., 2025) across three multi-hop QA benchmarks — HotPotQA, TwoWikiMultiHop, MuSiQue — showing that tuning the interface between a knowledge graph and an LLM produces consistent gains on correctness and F1. A separate head-to-head against Mem0, Graphiti, and LightRAG on a HotPotQA subset placed cognee's graph-completion retriever ahead on correctness. The paper's own caveat is honest: gains are consistent but not uniform across datasets. We've also written up earlier evaluation rounds in our AI memory tools evaluation and the August 2025 evals post.

For the full methodology, the head-to-head numbers, the honest caveats, and reproduction code, read about how memory systems compare on multi-hop QA.

Build patterns: frameworks, libraries, and homegrown

TL;DR: the ecosystem has split into three camps — block-style memory (Mem0, Letta), graph-based memory (cognee, Graphiti, Zep, LightRAG), and framework-bundled memory (LangChain, LangGraph, LlamaIndex). Pick based on how much structure your questions need.

The agent-memory ecosystem has matured into a few distinct camps. None is universally best.

Block-style memory. Mem0 and Letta (formerly MemGPT) store conversation blocks and retrieve them by similarity, sometimes with summary/consolidation layers. Simple integration, fast to ship, narrower ceiling on multi-hop reasoning.

Graph-based memory. Cognee, Graphiti, Zep, and LightRAG build typed entities and relationships, then retrieve via graph traversal in combination with vector search. Higher setup effort, better recall on questions that span sources.

Framework-bundled memory. LangChain, LangGraph, and LlamaIndex ship memory modules as part of a broader orchestration stack. Fine for single-stack teams, less flexible when you outgrow the abstractions.

Homegrown. Plenty of teams build on top of Postgres with pgvector plus some bespoke entity extraction. Works if you have the engineering capacity and specific needs no framework covers.

Pick based on the shape of your problem — how much structure your questions require, how much the data changes, and how much time you have to maintain your own plumbing.

When simpler memory solutions are enough

Not every agent needs a knowledge graph. Single-turn chatbots where the full context fits in the prompt don't. Short-lived agents with no cross-session state don't. FAQ bots over a static doc set don't. Small domains that fit in a hundred documents and rarely change usually don't.

If the agent can get away with "find the closest chunk and hand it to the model," it should. Structured memory earns its complexity on systems that accumulate state, reason across sources, or need to distinguish what was true last quarter from what's true now. For everything else, a well-tuned vector store and a clear retrieval prompt are cheaper and easier to maintain.

Getting started with cognee

Cognee is an open-source knowledge engine for AI agents — the tool this guide has been referencing throughout. It runs the Extract → Cognify → Load pipeline over whatever data you feed it, builds a typed knowledge graph paired with a vector index, and exposes a three-verb API. We walked through the architecture in detail in how cognee builds AI memory.

  raw data ──▶ Extract ──▶ Cognify ──▶ Load ──▶ hybrid store
                  │           │          │         │
              entities +   dedup +    graph DB +   remember /
              relations    version    vector idx   recall / forget

Install and run the loop:

pip install cognee

import cognee

await cognee.remember("Customer 9132 hit a sync bug; Maya fixed it 2025-11-03.")
results = await cognee.recall("Who resolved the last billing sync bug on this account?")

Star cognee on GitHub, read the docs, or check the benchmarks page for the full evaluation data.

FAQ

What is AI agent memory? AI agent memory is the infrastructure that lets an agent retain and recall information across turns, sessions, or a lifetime. It spans short-term (context window), long-term (persistent storage), structured knowledge (graph-based), and feedback (quality signals from past decisions).

What's the difference between short-term and long-term memory in AI agents? Short-term memory is the active context window — limited to the current session. Long-term memory persists across sessions, usually in a vector database. Long-term knowledge is long-term memory with structure — typed entities, relationships, versions — enabling multi-hop reasoning that flat retrieval can't handle.

Is RAG the same as AI agent memory? RAG is one retrieval strategy an agent memory system can use. It's not the whole thing. A production-grade memory system typically combines RAG with graph traversal, session caching, and feedback signals. Pure RAG is brittle on multi-hop questions and doesn't track change over time.

Do I need a knowledge graph for my AI agent? Not always. If your agent's domain is small and static, a well-tuned vector store is enough. Knowledge graphs pay off when the data changes, when answers require stitching across multiple sources, or when a wrong fact today comes from confidently quoting what was true last quarter.

How do AI agent memory frameworks compare — Mem0 vs. Letta vs. Zep vs. cognee? Roughly, Mem0 and Letta sit in the block-style camp (store and retrieve conversation blocks by similarity), while Zep, Graphiti, and cognee sit in the graph-based camp (typed entities and relationships plus vector search). On multi-hop benchmarks, graph-based approaches generally lead on correctness. The right fit depends on how structured your questions are and how much the data changes.

How do I give my AI agent persistent memory? Pick a memory system that matches your problem shape, integrate it at your agent's boundary (usually a remember call after each turn and a recall call before each response), and treat ingestion of new data as a standing process, not a one-time load. For a code-level walkthrough, our post on making your agent remember across sessions has a three-line example.

What are the types of long-term memory in AI agents? The common taxonomy, borrowed from cognitive science, lists semantic memory (facts), episodic memory (past interactions), and procedural memory (learned behaviors). A mature agent memory system represents all three, usually by typing entities and edges in a knowledge graph.

How do I evaluate an AI agent memory system? Run it on multi-hop QA benchmarks (HotPotQA, TwoWikiMultiHop, MuSiQue are common) with multiple metrics — exact match, F1, and an LLM-judge correctness score. Report both train and hold-out numbers. For a worked example, we wrote up the full methodology in our head-to-head evaluation against Mem0, Graphiti, and LightRAG.

Latest

Separate memories for organization, agent and user: Support AI Agent Use-Case

Deep DivesMay 6, 2026