Cut Cognee's Vector Memory by 8x with Qdrant's TurboQuant

Jun 1, 2026

6 minutes read

Jun 1, 2026

6 minutes read

Cut Cognee's Vector Memory by 8x with Qdrant's TurboQuant

David MyrielAI Researcher

The community Qdrant adapter for cognee just shipped TurboQuant support. Set one environment variable, run cognify, and every vector cognee stores in Qdrant comes back about 8x smaller, with no measurable hit to recall.

It works on any embedding model. There's nothing to retrain, no codebook to ship, no per-dataset tuning. If you've been watching your cognee memory footprint grow, this is the cheapest reduction you can make.

Cognee writes a lot of vectors

A flat RAG pipeline puts one vector in the store per chunk of text. Cognee is denser by design. When you call cognify(), it splits documents into chunks and embeds each one. It pulls entities out of those chunks and embeds them. It extracts the relationships between entities and embeds those too. Then it generates summaries at the chunk and document level and embeds those. Every one of those embeddings lives in the vector store, because every one is a hook the agent uses when reasoning across the graph.

The result is that a document producing three vectors in flat RAG often produces fifteen or more in cognee. That's the architecture working. It's also what makes the memory bill grow faster than people expect.

Figure 1: Same document, very different vector footprint.

FLAT RAG                          COGNEE
--------                          ------

document.pdf                      document.pdf
    |                                 |
    v                                 v
  chunk ---> vector                chunk ---> vector
  chunk ---> vector                    | ---> entity vector
  chunk ---> vector                    | ---> entity vector
                                       | ---> relationship vector
                                       | ---> summary vector
  3 vectors                         chunk ---> vector
                                       | ---> entity vector
                                       | ---> ...

                                  ~15 vectors per document

TurboQuant makes each vector 8x smaller

Qdrant 1.18 introduced a quantization method called TurboQuant. The version that matters here: TurboQuant 4-bit makes each stored vector 8x smaller than full precision while keeping recall roughly equal to scalar quantization at 4x. That's half the storage of the most common production setting today, and the agent doesn't notice the difference.

Figure 2: Storage cost per vector. TurboQuant 4-bit is half the size of scalar quantization at comparable recall.

float32 (full)           ############################  1.00x
scalar quant (4x)        #######                       0.25x
TurboQuant 4-bit (8x)    ###                           0.125x

What that looks like at scale: a million 1536-dim vectors take about 6 GB at full precision and about 750 MB at TQ4. Same vectors, same recall, one eighth the storage.

Why cognee gets more out of it

Compression is per-vector, but the saving you actually see scales with how many vectors you have. Cognee has more of them, so the absolute saving is bigger.

A flat RAG store with fifty thousand vectors and 50% compression frees twenty-five thousand vectors' worth of storage. The same compression on a cognee graph with two hundred and fifty thousand vectors frees a hundred and twenty-five thousand. Same compression rate, five times the absolute payoff.

Figure 3: Same compression rate. More vectors. Bigger absolute payoff.

                per-vector saving  x  vector count  =  total saving
                -----------------------------------------------------
FLAT RAG              50%          x    50,000      ->   25,000 freed
COGNEE GRAPH          50%          x   250,000      ->  125,000 freed

This is why TurboQuant is the first thing worth turning on if you're running cognee at any scale. The architecture you already chose is the one that benefits most from the compression.

Enabling it takes one env var

If you're starting a new graph, set one variable and let cognify run:

QDRANT_QUANTIZATION=tq4

You need your Qdrant server on 1.18 or later, and cognee-community-vector-adapter-qdrant on 0.3.0 or later. From then on, every collection cognee creates comes up quantized.

If you already have a graph and don't want to re-run cognify, the adapter exposes update_quantization. Call it once per collection and Qdrant rebuilds the index in the background:

from cognee.infrastructure.databases.vector import get_vector_engine

adapter = get_vector_engine()

await adapter.update_quantization("Entity_name")
await adapter.update_quantization("DocumentChunk_text")
await adapter.update_quantization("TextSummary_text")

Queries keep working through the rebuild. Qdrant falls back to the full-precision vectors until the quantized index is ready, then switches over.

What you get and what you don't

TurboQuant 4-bit holds recall within roughly one to three percentage points of full precision on most embedding models. That's the same range scalar quantization gives you, at twice the storage. You aren't trading agent quality for memory. You're getting the memory back without paying for it in quality.

Speed comes with it. Smaller vectors fit in CPU cache more readily, so each search runs a bit faster. For an agent that hits the vector store on every turn, that adds up.

It works on any embedding model. OpenAI, Cohere, local models, anything. There's no codebook to train per dataset, no calibration step to remember, no provider-specific handling.

It is not a magic ratio. The adapter has four settings, with steeper recall hits as you compress harder:

setting               env var    compression   recall hit vs full precision
----------------------------------------------------------------------------
TurboQuant 4-bit      tq4        8x            1-3 pp on most datasets
TurboQuant 2-bit      tq2        16x           5-15 pp
TurboQuant 1.5-bit    tq1.5      ~21x          10-30 pp
TurboQuant 1-bit      tq1        32x           15-32 pp

TQ4 is the default to reach for first. The lower settings exist for cases where memory pressure outweighs recall, and they all hold more recall than binary quantization at the same storage. Past TQ4, picking a setting is a tuning decision you make with your own data, since the recall hit depends heavily on the embedding model.

Turn it on

Cognee graphs aren't getting smaller. Every cognify run adds more chunks, more entities, more summaries, more relationships, each one another vector you'll be storing for as long as the agent has a memory. TurboQuant is the only change that hands back roughly half of that storage cost without rewriting anything else.

Install the community Qdrant adapter from PyPI:

pip install "cognee-community-vector-adapter-qdrant>=0.3.0"

New to cognee? Start with the cognee repo for a getting-started walkthrough. New to Qdrant? Run the Docker image locally, or use Qdrant Cloud for the free tier.

Set QDRANT_QUANTIZATION=tq4 before your first cognify call, and your graph comes up quantized.

Further reading: Qdrant 1.18 release notes, TurboQuant in Qdrant deep-dive.

Get started

Cognee is the fastest way to start building reliable Al agent memory.

Cognee Cloud

Latest

Cognee NewsJun 26, 2026

cognee 1.0: The Open-Source Memory Platform for AI Agents

cognee 1.0 is the first open-source memory platform built around a memory-native API — remember, recall, improve, forget — with full data ownership and deployment flexibility from managed cloud to edge.

Deep DivesJun 26, 2026

cognee on BEAM: SOTA Results Without a Benchmark-Specific Memory System

cognee beat SOTA on BEAM's 100k-token setting by 6.5% and matched SOTA at 10M tokens using only default open-source features — no custom benchmark-specific architecture.

Deep DivesJun 26, 2026

Just Postgres: Drop the Graph Database. Keep the Graph.

cognee 1.0 runs the full agent memory layer — graph, vectors, sessions, and metadata — on a single Postgres instance, eliminating the need for separate graph database, vector store, and Redis deployments.

Cognee NewsJun 26, 2026

cognee 1.0: The Open-Source Memory Platform for AI Agents

Deep DivesJun 26, 2026

cognee on BEAM: SOTA Results Without a Benchmark-Specific Memory System

cognee beat SOTA on BEAM's 100k-token setting by 6.5% and matched SOTA at 10M tokens using only default open-source features — no custom benchmark-specific architecture.

Deep DivesJun 26, 2026

Just Postgres: Drop the Graph Database. Keep the Graph.