Jun 26, 2026

12 minutes read

Jun 26, 2026

12 minutes read

Technical Note: Understanding the Token Cost of Persistent AI Memory

Vasilije MarkovicCo-Founder / CEO

TL;DR: Persistent memory cuts query-time tokens by retrieving a small context instead of resending the whole corpus on every question — but it pays an upfront ingestion cost first. We measure exactly where the tokens go, model the trade-off, and find break-even at roughly 23–26 repeated queries for the corpora tested, after which the gap keeps widening.

Abstract

Persistent AI memory can reduce query-time token usage by replacing repeated full-corpus prompts with retrieval over a reusable memory representation. This reduction comes with an upfront cost: the corpus must first be processed, summarized, and structured during ingestion.

This technical note isolates that trade-off. Using cognee, we compare two strategies for repeated question answering over a stable corpus: sending the corpus with every query, and building memory once before retrieving smaller contexts at query time. The goal is to explain where tokens are spent and measure when the upfront ingestion cost is recovered.

1. Scope

This note considers repeated question answering over a fixed corpus and compares two strategies for providing that corpus to a language model.

The first strategy is full-context prompting. Each query includes the corpus together with the question and a small instruction wrapper. This requires little preprocessing, but repeats most of the same token cost for every query.

The second strategy is persistent memory. The corpus is processed once with cognee.remember() to create a reusable memory representation. Subsequent queries use cognee.recall() to retrieve and pack only the relevant context for the answering model, shifting part of the token cost from query time to ingestion.

The analysis measures the language-model token cost of these two approaches. Specifically, it compares the one-time ingestion cost of building memory with the cumulative cost of repeatedly supplying the full corpus.

This is one way to study the economics of persistent memory rather than a complete model of production systems. In practice, deployments combine multiple techniques—including prompt caching, retrieval, context management, and incremental updates—to reduce inference cost. Those techniques are intentionally left outside the scope of this note so that the contribution of persistent memory can be examined in isolation.

2. Where the Tokens Go

The token cost measured in this note comes from two operations exposed by cognee:

cognee.remember(), which builds persistent memory from a corpus.
cognee.recall(), which retrieves context from that memory.

The majority of language-model tokens are consumed during cognee.remember(). This is expected: before a corpus can be queried efficiently, it must first be analyzed and transformed into a reusable representation.

2.1 Memory construction

During ingestion, documents are split deterministically into chunks. Each chunk is then processed by two independent LLM calls.

The first generates a summary that becomes part of the stored memory. The second extracts typed entities and relationships using a structured Pydantic schema, producing graph data that can be stored directly.

Because both operations are performed for every chunk, their token cost grows approximately with corpus size and dominates the one-time cost of building memory.

2.2 Retrieval

Once memory has been built, cognee.recall() assembles a compact context for the answering model.

Rather than returning only raw text, the retrieved context combines relevant source chunks, summaries, extracted entities, and facts derived from graph relationships. Under a fixed retrieval configuration, the size of this context remains comparatively stable even as the corpus grows.

The measurements in the following sections compare the one-time cost of constructing this representation with the cumulative cost of repeatedly supplying the original corpus.

3. Cost model

The previous section showed where the tokens are spent. We can now describe the cumulative cost of each strategy with a simple model.

For full-context prompting, the same corpus is sent with every query, so the cumulative token cost grows linearly with the number of questions.

full_context_cost(queries) =
    queries × (corpus_tokens + query_overhead)

where:

corpus_tokens is the size of the corpus,
query_overhead is the instruction wrapper and question,
queries is the number of queries.

For persistent memory, the corpus is processed only once. The cumulative cost is therefore the one-time ingestion cost plus the retrieved context for each query.

memory_cost(queries) =
    ingestion_tokens + queries × retrieved_context_tokens

where:

ingestion_tokens is the cost of cognee.remember(),
retrieved_context_tokens is the average context returned by cognee.recall().

The break-even point is the number of queries at which the cumulative token cost of the two approaches becomes equal.

break_even_queries =
    ingestion_tokens /
    (corpus_tokens + query_overhead - retrieved_context_tokens)

Each quantity in these equations is measured directly during the experiments rather than estimated. Embeddings are computed locally and are therefore excluded from the reported language-model token counts.

4. Experimental Setup

A deterministic synthetic corpus is used so that every experiment can be reproduced exactly.

The corpus is intentionally information-dense: each record packs many entities and relationships into a small amount of text, which stresses the graph-extraction step of ingestion. To quantify this, we sampled graph-extraction output across corpora using 30 chunks per corpus, measured with both openai/gpt-5-mini and anthropic/claude-sonnet-4-5. On ordinary prose, represented by excerpts from War and Peace, graph extraction produced about 1.2 output tokens per input token. On the dense synthetic records used here, it produced roughly 8–10 output tokens per input token. The ingestion costs reported below should therefore be read in the context of this deliberately dense corpus: they reflect a setting where many structured facts are available to extract from each source token. Less dense text would generally produce less graph output per input token, reducing ingestion cost and moving break-even earlier.

The corpus is processed once with cognee.remember(). Every language-model call made during ingestion is instrumented and its prompt and completion tokens recorded.

Queries are then executed with cognee.recall(), measuring the size of the retrieved context before answer generation.

The baseline constructs a prompt containing the complete corpus together with a fixed instruction and the user query. Completion tokens are excluded from the comparison because both approaches ultimately require answer generation; the measurement focuses only on the context delivered to the answering model.

Representative prompts, dataset details, and instrumentation are included in the appendices.

5. Results

Results are reported for two reproducible runs of different corpus sizes: a 10k-token run processed with openai/gpt-5-mini and a 100k-token run processed with anthropic/claude-sonnet-4-5. Each run uses its model's native tokenizer, so absolute token counts are not directly comparable between the two runs; the multipliers and ratios, however, are. Both runs use the high-density corpus described in Section 4, so the costs below reflect that deliberately dense input.

As a reference point for full-context cost at larger scale, we also ran the full-context baseline alone over an ~853k-token corpus (openai/gpt-5.5). There each query carries the entire corpus—about 853k tokens—for a cumulative 17.1M tokens across 20 queries, while still answering all 20 questions correctly. We did not build memory over this corpus, so it illustrates only how full-context cost grows with scale and is not used to compute a break-even.

5.1 Ingestion token usage

The majority of language-model tokens are consumed during cognee.remember(), where every chunk is processed independently for summarization and structured graph extraction.

The following tables break down the measured ingestion cost by operation for each run.

10k run (openai/gpt-5-mini)

Operation	Calls	Prompt tokens (incl. schema)	Output tokens	Total tokens	Share of ingestion
Chunk summarization	214	71,715	39,161	110,876	28.0%
Structured graph extraction	214	144,475	140,151	284,626	72.0%
Health check / other	1	12	1	13	<0.1%
Total	429	216,202	179,313	395,515	100%

100k run (anthropic/claude-sonnet-4-5)

Operation	Calls	Prompt tokens (incl. schema)	Output tokens	Total tokens	Share of ingestion
Chunk summarization	1,751	727,287	485,827	1,213,114	26.0%
Structured graph extraction	1,751	1,301,615	2,144,503	3,446,118	74.0%
Health check / other	1	13	1	14	<0.1%
Total	3,503	2,028,915	2,630,331	4,659,246	100%

As expected, chunk summarization and structured graph extraction account for the overwhelming majority of ingestion tokens, and graph extraction alone is the single largest contributor (~72–74% of ingestion). Structured graph extraction also incurs additional prompt overhead because the model produces structured output conforming to a Pydantic schema rather than unconstrained text.

5.2 Ingestion multiplier

To compare runs of different corpus sizes, we express the ingestion cost relative to the size of the source corpus.

Quantity	10k run	100k run
Corpus tokens (`T`)	16,075	201,987
Ingestion tokens (`I`)	395,515	4,659,246
Ingestion multiplier (`I / T`)	24.6×	23.1×

Across the evaluated datasets, the ingestion multiplier remains relatively stable despite a roughly 12× increase in corpus size. This suggests that ingestion grows approximately linearly under the evaluated workflow.

As noted in Section 4, these multipliers reflect the high-density corpus. On less dense text—where graph extraction produces roughly 1.2 output tokens per input token rather than the 8–10 measured here—the ingestion multiplier, and the break-even points below, would be lower as well.

The ingestion multiplier is also a useful practical metric. It expresses the cost of building memory in units of "corpus-equivalents" and gives an intuitive sense of how many repeated full-context queries are needed before preprocessing begins to pay for itself: in these runs the break-even point in Section 5.4 is close to the ingestion multiplier.

5.3 Retrieval size

Once memory has been built, each query retrieves only a small amount of context compared with the full corpus.

Quantity	10k run	100k run
Queries	20	20
Mean query tokens	5.0	5.4
Mean retrieved context	1,118	1,864
Minimum retrieved context	723	1,076
Maximum retrieved context	1,841	3,855
Corpus size	16,075	201,987

Although the corpus grows by roughly 12× between experiments, the retrieved context grows only modestly (from ~1,118 to ~1,864 tokens) under a fixed retrieval configuration (top_k = 10). As a result, the incremental token cost of each additional query is determined primarily by the retrieval configuration rather than by corpus size.

5.4 Break-even

The cumulative cost curves follow directly from the model introduced in Section 3.

The full-context cost per query is the measured corpus size plus a small fixed wrapper (instruction and question, ~32 tokens), consistent with the baseline instrumentation. The per-query memory cost is the mean retrieved context. Applying the break-even formula from Section 3:

Quantity	10k run	100k run
Corpus tokens	16,075	201,987
Ingestion tokens	395,515	4,659,246
Mean retrieved context	1,118	1,864
Full-context cost per query	16,107	202,019
Break-even queries	~26	~23
Queries for a 7× reduction (example)	~334	~173

Both runs break even at roughly the same point—approximately 23–26 repeated queries—even though the corpus differs by an order of magnitude. This follows from the cost model: because both the ingestion cost and the full-context cost scale approximately linearly with corpus size, the break-even point in queries is close to the ingestion multiplier (Section 5.2).

Past break-even, the gap widens as queries accumulate. As one reference point, we also calculate when the cumulative token cost of memory becomes seven times lower than full-context prompting. Under this model, that point occurs after roughly 173 queries for the 100k run and 334 queries for the 10k run. This is included as an example of how the curves continue to separate after break-even, not as a universal threshold.

These figures come from the high-density corpus described in Section 4. For less dense text, graph extraction produces less structured output per input token, so the ingestion multiplier would generally be lower and the same milestones would be reached with fewer queries.

Across both measured runs, the same pattern emerges: ingestion introduces a large one-time cost, while subsequent queries add only a comparatively small amount of retrieved context. The break-even point depends on the balance between those two quantities.

6. Discussion

The measurements describe one specific cost model for persistent AI memory.

Within this model, preprocessing shifts language-model token usage from query time to ingestion. The dominant ingestion costs come from chunk summarization and structured graph extraction, while query-time cost is determined primarily by the size of the retrieved context.

The reported break-even points should therefore be interpreted within the assumptions of the experiment. Different retrieval strategies, prompt caching, evolving corpora, or different memory construction pipelines will change the absolute numbers, but the same accounting framework can be applied to measure those systems as well.

7. Summary

This note measures the language-model token cost of building and querying persistent memory.

For the evaluated workflow, most tokens are spent once during cognee.remember(), primarily through chunk summarization and structured graph extraction. Subsequent queries reuse that work by retrieving a compact context assembled from stored summaries, entities, relationships, and source text.

Whether preprocessing is beneficial depends on how often the corpus is queried. The measurements presented here provide one quantitative reference point for reasoning about that trade-off.

Appendix A. Dataset

Corpus generation, corpus size, tokenizer, representative records.

Appendix B. Representative Prompts

Chunk summarization
Structured graph extraction
Retrieval output

Appendix C. Detailed Token Accounting

Existing ingestion tables and raw measurements.

Appendix D. Reproduction

Environment, configuration, and measurement methodology.

Want to measure this on your own corpus?

Get started

Cognee is the fastest way to start building reliable Al agent memory.

Cognee Cloud

Latest

Deep DivesJun 26, 2026

Technical Note: Understanding the Token Cost of Persistent AI Memory

Persistent memory trades an upfront ingestion cost for cheaper queries. We measure where the tokens go in cognee, model the trade-off, and find break-even at roughly 23–26 repeated queries — after which the gap keeps widening.

Deep DivesJun 26, 2026

Behind the Viral Numbers: How We Got 7x Cheaper and 145% Better

Our LinkedIn and X videos put two numbers on screen — 7x cheaper than chat and 145% better than the best alternative. Here's exactly where each one came from, linked to our BEAM report.

Cognee NewsJun 26, 2026

cognee 1.0: The Open-Source Memory Platform for AI Agents

cognee 1.0 is the first open-source memory platform built around a memory-native API — remember, recall, improve, forget — with full data ownership and deployment flexibility from managed cloud to edge.

Deep DivesJun 26, 2026

Technical Note: Understanding the Token Cost of Persistent AI Memory

Deep DivesJun 26, 2026

Behind the Viral Numbers: How We Got 7x Cheaper and 145% Better

Our LinkedIn and X videos put two numbers on screen — 7x cheaper than chat and 145% better than the best alternative. Here's exactly where each one came from, linked to our BEAM report.

Cognee NewsJun 26, 2026

cognee 1.0: The Open-Source Memory Platform for AI Agents