cognee on BEAM: SOTA Results Without a Benchmark-Specific Memory System
TL;DR: We beat SOTA on BEAM's 100k-token setting by 6.5%, and achieved on par performance with the current SOTA on BEAM 10M without building custom memory architecture for the benchmark. This post is about how we did it, where the caveats are, and what we think the result actually means.
Long-context benchmarks usually ask whether a model can find the right answer somewhere in a very large input. Because BEAM asks whether a system can keep track of a long conversation as it changes, we found it a more suitable and useful test for agent memory.
Since cognee is all about AI memory, context management, and information retrieval, we decided to run it against BEAM and see how we would fare.
Turns out we did pretty well. The result itself is only part of the story, though. As always, understanding how we got there is the more interesting part.
About BEAM
Here at cognee, we have mixed feelings about academic benchmarks. We're no strangers to them — in the early days, we ran them for hyperparameter optimization and even wrote a paper about it. However, they can contain noisy corpus text, overstate their purpose, and feature golden question-answer pairs of varying quality. BEAM is no exception, and I'll get to its quirks later.
That said, BEAM takes a more principled approach than most. The dataset is fully synthetic but built with real structure: conversation arcs first, then session interactions and topics, then user-assistant turns. This gives the creators meaningful control over how information is introduced, revised, contradicted, and distributed across the conversation.
It comes in sizes ranging from 100K to 10M tokens, and the questions cover ten memory abilities — from fact retrieval and event ordering to abstention and contradiction resolution. Each comes with a gold-standard answer and a rubric, scored by an LLM using publicly available evaluation code.
Anyone can ingest a conversation, ask the questions, generate answers, and run the scoring. So that's what we did.
How we ingested the data — and what we didn't do
We ingested the conversations using cognee's default settings and standard open-source features available in our codebase, mostly through the remember operation. We did not train custom models, build BEAM-specific graph builders, or create specialized ingestion pipelines for the benchmark.
I want to be explicit about this because cognee can support much more customized, domain-specific pipelines for enterprise use cases. We didn't do any of that here. The goal was to see how our standard memory architecture handled BEAM, not to build a benchmark-shaped system around it.
On the ingestion side, the only BEAM-specific work was light data processing before ingestion.
Organization, Chunking, and Preprocessing
Each BEAM conversation consists of batches that correspond to one continuous conversation between a user and the assistant. At ingestion time, we treated each of those conversations as a single document.
cognee operates by splitting documents into chunks and extracting knowledge graphs from them, so we treated each user-assistant turn as one chunk.
A small fraction of turns (less than 1%) contained noisy or repetitive text. We ran a single generic cleanup pass before ingestion, which wasn't tuned against BEAM's questions or answers.
Global Context Index
We also built the global context index during ingestion. This cognee feature is enabled by default and creates a tree-like structure over the ingested content.
This makes it effective for BEAM because many questions depend on temporal changes and evidence spread across long spans of context.
Both of those features come in handy when dealing with long conversations.
Sessions and Knowledge Distillation
BEAM conversation batches correspond roughly to QA sessions in cognee. These are used in production conversation QA systems, where they preserve conversational continuity and distill the resulting knowledge into the graph for use in future conversations.
They do so by producing session learnings: synthesized memory items inferred from the session as a whole rather than copied directly from individual turns.
In a real cognee-powered system, memory would guide the assistant's responses as the conversation unfolds, which probably would have saved everyone a few turns, but then we'd have less benchmark to benchmark.
Because BEAM provides a completed conversation, we couldn't use session memory in the same way cognee would during a live interaction. Instead, we used the knowledge extraction and distillation system from the codebase to infer learnings from the finished conversation and write them back into memory.
BEAM quirks and musings on the fine art of benchmarking
BEAM is an ambitious dataset. Building something like it takes a lot of effort, and the many design decisions behind it inevitably introduce some quirks.
Corpus noise
A small number of user-assistant turns contained repeated random strings, nonsensical text, or leaked fragments of the plan they were generated from, with long assistant turns disproportionately affected. We cleaned these cases with a small LLM call for compression.
BEAM's golden answers also include references to supporting conversation turns, but at this scale those references contain some errors and inconsistencies, so they are useful but noisy as a sole proxy for retrieval quality.
Rubrics and scoring
The rubrics' wording strongly affects not only what the answer should contain, but also the form and format it should take, which can affect the final score.
Since BEAM doesn't make those requirements known upfront, others in the community have used custom prompts and retrieval strategies by question type; we did the same to make our numbers directionally comparable, creating separate prompts per question type and exploring both overall and per-question hyperparameter optimization of our one-shot hybrid retriever on the 100K conversations.
Finally, because BEAM answers are judged by an LLM against a rubric, the scores carry additional noise, especially with smaller judging models.
From 100K to 10M tokens
We started with the 100K conversations to learn the benchmark before doing anything at the 10M scale.
100K
We ingested several conversations randomly, used a subset to experiment with prompts and retrieval parameters, and used the rest to check whether the changes transferred. When the results felt stable, we ran a single 100K conversation end to end.
With one-shot retrieval and custom prompts per question type, we reached a score of 0.79 — above the reported SOTA of 0.735, and stable across as few as four repeated rounds of question answering and evaluation rounds against a fixed ingestion.
Using a routing strategy that applies different retriever parameters to different question types consistently led to scores above 0.8.
10M
With those learnings, we ingested a single 10M-token conversation. As expected, the score dropped with the same settings.
We experimented with retrieval parameters and routing strategies on the 10M conversation, then independently reran the stronger configurations to reduce the chance of reporting a lucky pass.
We reached scores as high as 0.67, against a reported SOTA of 0.641.
The margin is small, and the number should be taken with a grain of salt. This result did involve more parameter exploration on the target conversation than we'd like, which is worth being honest about. We used some insights that were learned from the experiments on 100k and implemented them for 10M. However, the goal here was not to build something that is highly customizable for one particular benchmark just so we can show the SOTA results — but instead, the goal is to show how easy it is to get very good results using cognee's open-source features.
Our confidence that this (or a marginally lower) score is defensible comes from observing how parameter choices transferred across the smaller ingestions, but we're not trying to overstate the meaning of a single 10M run.
We also found that multi-turn agentic retrieval strategies pushed the score higher, but that felt outside the spirit of the benchmark. For now, this is as far as we wanted to take this (though we will have more to say about agentic retrieval in other contexts).
What to make of this
BEAM is a principled attempt to evaluate long-form conversational retrieval, and the community is clearly resonating with it. Still, like with any benchmark, using it to assess general quality beyond a narrow scope is often unreliable, and optimizing for it is not guaranteed to translate into real-world improvements.
What we wanted to find out was how open-source cognee would do without being built for it. The results above come from a handful of iterations to understand the benchmark, establish a baseline, and see how far we could get without overfitting. Reaching state-of-the-art in that process was not a dramatic result, but a real and rewarding one.
The numbers, of course, aren't a definitive measure of anything. But they're a decent directional signal that the foundational approach we've taken to AI memory and retrieval is moving in the right direction.
That's the part I find genuinely satisfying: seeing cognee's memory architecture hold up in a benchmark that actually stresses memory.
— Vasilije Marković, Founder and CEO, cognee
Want to dig into the BEAM results?





