Jan 7, 2026

9 minutes read

Jan 7, 2026

9 minutes read

AI Agent Memory Benchmarks: Cognee vs. Mem0, Graphiti, and LightRAG

Vasilije MarkovicCo-Founder / CEO

Part of our complete guide to AI agent memory.

Every memory-for-agents vendor claims their approach makes agents smarter. Nobody agrees on what "smarter" means or how to measure it. So we published the measurements.

This post walks through two benchmark exercises the cognee team ran — one peer-reviewable paper across three multi-hop QA datasets, and one head-to-head against Mem0, Graphiti, and LightRAG on HotPotQA. The numbers, the methodology, and the caveats. Code is open; reproduce whatever you want to check. For the broader competitive landscape we covered earlier, see our AI memory tools evaluation and the August 2025 evals post.

If you want the concept side — what long-term knowledge is and why memory alone falls short — read about why memory alone isn't enough for reasoning agents first. This post assumes you know what a knowledge graph is and want to see whether it actually helps.

What we benchmarked and how

Two exercises, same codebase, different questions.

Exercise 1 — the paper (Markovic et al., 2025): can we tune the interface between a knowledge graph and an LLM to produce better answers on multi-hop QA? Multi-hop means the answer isn't in a single document — you have to connect facts across two or three sources. These are the questions knowledge graphs should be good at.

Three benchmarks: HotPotQA, TwoWikiMultiHop, MuSiQue. Parameters tuned across chunking, graph construction, retrieval strategy, and prompting. The optimization itself runs on Dreamify, our hyperparameter framework. Each configuration scored with exact match, F1, and DeepEval's LLM-based correctness metric.

Exercise 2 — the head-to-head: how does cognee's best configuration compare to other memory systems on the same question set? Subset of 24 HotPotQA questions, 45 repeated runs per system on Modal to absorb LLM-judge variance, same DeepEval scoring. Systems tested: Cognee, Mem0 (OpenAI-backed memory QA), Graphiti (LangChain + Neo4j), LightRAG (Falkor GraphRAG-SDK).

All code, configs, and datasets are in cognee/evals. The benchmarks page has the visualizations.

Finding 1: tuning the interface moves the needle

Baseline cognee runs use default settings — sensible, but not optimized. The paper applied hyperparameter search and measured the lift.

Benchmark	Metric	Baseline	Optimized	Gain
HotPotQA	Correctness	0.476	0.815	+71%
HotPotQA	F1	0.169	0.840	+397%
HotPotQA	Exact Match	0.042	0.667	+1,496%
TwoWikiMultiHop	Correctness	0.348	0.582	+67%
TwoWikiMultiHop	F1	0.148	0.625	+322%
MuSiQue	Correctness	0.414	0.674	+63%
MuSiQue	F1	0.145	0.654	+351%

The takeaway here isn't that cognee magically produces correct answers; it's that the interface between a knowledge graph and an LLM has a lot of tunable surface — chunking size, retrieval strategy, prompt shape — and tuning it makes a meaningful difference. The F1 jumps in the table are not rounding errors. We pulled the same threads from a different angle in the art of intelligent retrieval.

The paper's own framing is worth quoting: "Future progress will depend not only on architectural advances but also on clearer frameworks for optimization and evaluation in complex, modular systems." Memory systems are systems, and they reward engineering.

Finding 2: the gains survive on unseen data

Training-set wins are easy to fake. Real systems have to generalize. The paper reports hold-out numbers on data the optimizer never saw.

Benchmark	Metric	Train Set	Hold-Out
HotPotQA	Correctness	0.815	0.715
HotPotQA	F1	0.840	0.819
MuSiQue	Correctness	0.674	0.596
TwoWikiMultiHop	F1	0.625	0.704

Correctness drops some on HotPotQA and MuSiQue — that's normal; optimizers do fit. F1 on TwoWikiMultiHop actually increases on the hold-out, which is a sign the optimization picked up real structure, not artifacts. Either way, the gains over baseline survive.

This matters because memory systems are rarely evaluated with a proper train/hold-out split. "We got 0.8 on HotPotQA" without a hold-out number is suggestive, not conclusive.

Finding 3: structured recall beats chunk retrieval head-to-head

On the 24-question HotPotQA subset, cognee was run with the configuration that won Exercise 1 — the GRAPH_COMPLETION_COT retriever (graph traversal plus chain-of-thought prompting) on the tuned chunking and prompt settings from the paper. Each competitor was run on its own published defaults — the same setup a developer would get out of the box. Each system was repeated 45 times to absorb LLM-judge variance.

System	Configuration	Human-like Correctness	DeepEval Correctness	F1
Cognee	GRAPH_COMPLETION_COT, tuned	0.93	0.85	0.84
Graphiti	LangChain + Neo4j, default	0.88	0.74	0.70
LightRAG	default	0.96	0.67	0.09
Mem0	OpenAI memory QA, default	0.72	0.54	0.12

For comparison, the Exercise 1 paper measured baseline (untuned) cognee on full HotPotQA at 0.476 DeepEval Correctness — well below the optimized 0.815 number on the same dataset. The head-to-head subset isn't directly comparable to the full-dataset paper numbers, but the gap between baseline and tuned cognee gives a sense of how much of the 0.85 here comes from the framework versus the configuration. Most of the gap to Mem0 isn't tuning — it's structure.

Two things worth being careful about. First, the configuration disclosure above matters: if a reader's takeaway is "cognee beats Graphiti by 11 points on correctness," they should know cognee was running tuned and Graphiti wasn't. We didn't run a tuning sweep on the competitors — that would be its own paper. Run your own.

Second, LightRAG's Human-like Correctness is 0.96, just above cognee's 0.93. Looks close. But LightRAG's F1 is 0.09. That combination means verbose, approximate answers that happen to read naturally while being mostly wrong on specifics. It's a real LLM-judge artifact — LLMs-as-judges like fluency more than they should — and it's why reporting multiple metrics matters.

The honest caveats

Three, straight from the paper and the eval README.

This is cognee's benchmark, not an independent study. We built the harness and we published the numbers. That is useful but not neutral. The mitigation is that everything — datasets, code, configs — is open and the team publishes reproduction instructions. Run it yourself on a different LLM or a different dataset and see what you get.

The gains are consistent but not uniform. The paper is explicit: "performance varying across datasets and metrics." HotPotQA saw a 1,496% EM jump; TwoWikiMultiHop saw smaller lifts. A single stat on a single benchmark rarely generalizes. The three-benchmark picture does.

LLM-as-judge has variance. DeepEval's correctness metric uses an LLM to grade answers. The judge has its own biases — length preference, fluency preference, quirks with numeric answers. The 45-cycle repeat protocol absorbs some of this; the honest answer is that no metric is clean and you should look at several.

What this actually means for building agents

Read the tables plainly.

For multi-hop questions — the kind that require connecting facts across documents — structured recall (cognee, Graphiti) beats chunk-based recall (Mem0) by a wide margin on correctness. If your agent's job involves reasoning over history, the benchmarks say structure matters. The case for why structure matters lives in why memory alone isn't enough for reasoning agents.

For single-hop questions — "what did the user say about X" — the difference is smaller and the overhead of a knowledge graph may not pay off. A well-tuned vector store is simpler.

Between structured approaches, tuning the interface matters. Cognee's 63–71% correctness gains over its own baseline across HotPotQA, TwoWikiMultiHop, and MuSiQue suggest that picking a framework and running it on defaults leaves a lot on the table. The choice of retriever (graph completion vs. chunk-level vs. graph-summary), the chunking strategy, and the prompt template all move the score meaningfully.

Reproduce it yourself

git clone https://github.com/topoteretes/cognee
cd cognee/evals
# follow the README to run the Modal benchmark against any system

The paper is at arxiv.org/abs/2505.24478. The benchmarks page with full visualizations is at cognee.ai/research-and-evaluation-results.

If you'd rather skip the numbers and understand what cognee does conceptually, that's the long-term knowledge post. For the architecture under the hood, read how cognee builds AI memory. If you want to try cognee directly:

pip install cognee

Star cognee on GitHub. Run the benchmarks. Come back and tell us what you found.

FAQ

Why HotPotQA, TwoWikiMultiHop, and MuSiQue specifically? All three are multi-hop QA benchmarks — answers require connecting information across multiple documents. These are the questions knowledge graphs should be good at; if a structured memory system can't beat baseline on multi-hop, it doesn't deserve the complexity.

Why only 24 questions in the head-to-head? A full HotPotQA run is expensive and noisy. The head-to-head uses a carefully selected 24-question subset with 45 repeated runs per system (over 1,000 evaluations total) to absorb LLM-judge variance. The full-dataset paper numbers are in Exercise 1.

What does "DeepEval Correctness" actually measure? It's an LLM-graded correctness score — a judge model reads the predicted answer and the gold answer and decides whether they are materially the same. It's less brittle than exact match for open-ended answers, but carries judge-model bias. The benchmark reports it alongside F1, EM, and human-like correctness so readers can triangulate.

Why does LightRAG score 0.96 human-like but 0.09 F1? LLM judges prefer fluent, verbose answers. LightRAG's outputs read naturally but miss specifics. Low F1 confirms the approximate quality; high human-likeness reveals the judge's bias. The pairing is informative, which is why both metrics are reported.

Are these numbers comparable to other published benchmarks? Partially. Same datasets, same broad methodology (DeepEval, multi-hop, 45-run protocol). But LLM-judge scores depend on the judge model and version — absolute numbers don't transfer cleanly across benchmarks run by different teams. Relative rankings within a single run are the safest comparison.

Does cognee always beat Mem0? On this benchmark, yes — across correctness, F1, and EM. On a different benchmark with different question shapes, results might differ. Run your own evaluation on questions that match your production use case before deciding.

Last updated: January 2026.

Cognee is the fastest way to start building reliable Al agent memory.

Latest

Separate memories for organization, agent and user: Support AI Agent Use-Case

Deep DivesMay 6, 2026