Home< BlogDeep Dives
Jun 26, 2026
4 minutes read

Behind the Viral Numbers: How We Got 7x Cheaper and 145% Better

Vasilije Markovic
Vasilije MarkovicCo-Founder / CEO

TL;DR: Our LinkedIn and X videos put two numbers on screen — 7x cheaper than chat and 145% better than the best alternative. Here's exactly where each one came from, so you can check the math yourself.

The videos are short by design. The numbers behind them aren't arbitrary, though, so this post unpacks both — quickly.

"7x cheaper than chat"

This is the cost number — and the honest version of it is about how often you ask, not a single query.

Full-context retrieval — the "just put everything in context every time" approach — re-stuffs the entire corpus into the model on every single question, so its token cost grows linearly with the number of questions. Persistent memory works differently: it pays a one-time cost up front to build memory with remember(), then each recall() query pulls only a small, relevant slice of context. Memory therefore starts out behind, and catches up as queries accumulate.

Our token report measures exactly where the lines cross. Break-even — where building memory once becomes cheaper than re-sending the corpus — lands at roughly 23–26 repeated queries over a stable corpus. Past that point, every additional question widens the gap: full-context keeps paying for the whole corpus, while memory keeps paying for a small retrieval.

So when do you reach 7x cheaper — memory spending only ~14.4% of the tokens? That's a milestone further along the same curve. In our measurements it arrives at roughly 170+ queries over a 100k-token corpus, sooner for larger corpora and later for smaller or more information-dense ones. The "7x" is what a realistic, repeated-query workload over a stable corpus converges to — not a claim about a single question.

We've now published a dedicated token report that breaks this down end to end — where the tokens actually go (ingestion vs. retrieval), the cost model, the exact break-even math, and how the gap keeps widening as queries accumulate: Understanding the Token Cost of Persistent AI Memory →.

"145% better than Opus"

This is the quality number, and it comes straight out of our BEAM work.

On BEAM's 100k-token setting:

  • A standard RAG baseline on Llama-4: 32.3%
  • cognee: 79%

That's +46 points, or 145% better in relative terms — the number you saw in the video.

"100 billion token window"

This number is about scale, not a single benchmark run. It's an estimate of how many tokens you'd get if you converted the volume of data cognee can actually ingest into tokens.

cognee ingests data at scale — parallelized and distributed — so it isn't bound by what a single model's context window can hold. In practice it can take in up to a terabyte of data, and realistically even more. The 100 billion figure is a conservative reference point along that curve, not a ceiling.

Here's the math, using 500 GB as the worked example:

  • 500 GB ≈ 500 × 10⁹ = 5 × 10¹¹ bytes
  • Text tokenizes at roughly 1 token per 4 characters for clean English. For real-world mixed files — with formatting, markup, and structure — a conservative estimate is about 5 bytes per token.
  • 5 × 10¹¹ bytes ÷ 5 bytes/token = 1 × 10¹¹ = 100 billion tokens

So 500 GB of ingested data lands around a 100 billion token equivalent. Since cognee scales past a terabyte, that 100B window is the floor of what it can hold in memory, not the limit.

Where to dig in

The quality side is documented in full in our BEAM write-up — the ingestion, the caveats, and what the results actually mean:

Read the BEAM report →

And the cost side — the "7x cheaper" number — gets its own deep dive in the token report, which explains exactly how that worked: Understanding the Token Cost of Persistent AI Memory →.

— The cognee team

Want to see these numbers on your own data?

Get started

Cognee is the fastest way to start building reliable Al agent memory.

Cognee Cloud
Latest
Technical Note: Understanding the Token Cost of Persistent AI Memory
Persistent memory trades an upfront ingestion cost for cheaper queries. We measure where the tokens go in cognee, model the trade-off, and find break-even at roughly 23–26 repeated queries — after which the gap keeps widening.
Behind the Viral Numbers: How We Got 7x Cheaper and 145% Better
Our LinkedIn and X videos put two numbers on screen — 7x cheaper than chat and 145% better than the best alternative. Here's exactly where each one came from, linked to our BEAM report.
cognee 1.0: The Open-Source Memory Platform for AI Agents
cognee 1.0 is the first open-source memory platform built around a memory-native API — remember, recall, improve, forget — with full data ownership and deployment flexibility from managed cloud to edge.