Home< BlogIntegrations
Jun 1, 2026
6 minutes read

Cut Cognee's Vector Memory by 8x with Qdrant's TurboQuant

David Myriel
David MyrielAI Researcher

Cut Cognee's Vector Memory by 8x with Qdrant's TurboQuant

The community Qdrant adapter for cognee just shipped TurboQuant support. Set one environment variable, run cognify, and every vector cognee stores in Qdrant comes back about 8x smaller, with no measurable hit to recall.

It works on any embedding model. There's nothing to retrain, no codebook to ship, no per-dataset tuning. If you've been watching your cognee memory footprint grow, this is the cheapest reduction you can make.

Cognee writes a lot of vectors

A flat RAG pipeline puts one vector in the store per chunk of text. Cognee is denser by design. When you call cognify(), it splits documents into chunks and embeds each one. It pulls entities out of those chunks and embeds them. It extracts the relationships between entities and embeds those too. Then it generates summaries at the chunk and document level and embeds those. Every one of those embeddings lives in the vector store, because every one is a hook the agent uses when reasoning across the graph.

The result is that a document producing three vectors in flat RAG often produces fifteen or more in cognee. That's the architecture working. It's also what makes the memory bill grow faster than people expect.

Figure 1: Same document, very different vector footprint.

TurboQuant makes each vector 8x smaller

Qdrant 1.18 introduced a quantization method called TurboQuant. The version that matters here: TurboQuant 4-bit makes each stored vector 8x smaller than full precision while keeping recall roughly equal to scalar quantization at 4x. That's half the storage of the most common production setting today, and the agent doesn't notice the difference.

Figure 2: Storage cost per vector. TurboQuant 4-bit is half the size of scalar quantization at comparable recall.

What that looks like at scale: a million 1536-dim vectors take about 6 GB at full precision and about 750 MB at TQ4. Same vectors, same recall, one eighth the storage.

Why cognee gets more out of it

Compression is per-vector, but the saving you actually see scales with how many vectors you have. Cognee has more of them, so the absolute saving is bigger.

A flat RAG store with fifty thousand vectors and 50% compression frees twenty-five thousand vectors' worth of storage. The same compression on a cognee graph with two hundred and fifty thousand vectors frees a hundred and twenty-five thousand. Same compression rate, five times the absolute payoff.

Figure 3: Same compression rate. More vectors. Bigger absolute payoff.

This is why TurboQuant is the first thing worth turning on if you're running cognee at any scale. The architecture you already chose is the one that benefits most from the compression.

Enabling it takes one env var

If you're starting a new graph, set one variable and let cognify run:

You need your Qdrant server on 1.18 or later, and cognee-community-vector-adapter-qdrant on 0.3.0 or later. From then on, every collection cognee creates comes up quantized.

If you already have a graph and don't want to re-run cognify, the adapter exposes update_quantization. Call it once per collection and Qdrant rebuilds the index in the background:

Queries keep working through the rebuild. Qdrant falls back to the full-precision vectors until the quantized index is ready, then switches over.

What you get and what you don't

TurboQuant 4-bit holds recall within roughly one to three percentage points of full precision on most embedding models. That's the same range scalar quantization gives you, at twice the storage. You aren't trading agent quality for memory. You're getting the memory back without paying for it in quality.

Speed comes with it. Smaller vectors fit in CPU cache more readily, so each search runs a bit faster. For an agent that hits the vector store on every turn, that adds up.

It works on any embedding model. OpenAI, Cohere, local models, anything. There's no codebook to train per dataset, no calibration step to remember, no provider-specific handling.

It is not a magic ratio. The adapter has four settings, with steeper recall hits as you compress harder:

TQ4 is the default to reach for first. The lower settings exist for cases where memory pressure outweighs recall, and they all hold more recall than binary quantization at the same storage. Past TQ4, picking a setting is a tuning decision you make with your own data, since the recall hit depends heavily on the embedding model.

Turn it on

Cognee graphs aren't getting smaller. Every cognify run adds more chunks, more entities, more summaries, more relationships, each one another vector you'll be storing for as long as the agent has a memory. TurboQuant is the only change that hands back roughly half of that storage cost without rewriting anything else.

Install the community Qdrant adapter from PyPI:

New to cognee? Start with the cognee repo for a getting-started walkthrough. New to Qdrant? Run the Docker image locally, or use Qdrant Cloud for the free tier.

Set QDRANT_QUANTIZATION=tq4 before your first cognify call, and your graph comes up quantized.


Further reading: Qdrant 1.18 release notes, TurboQuant in Qdrant deep-dive.

Cognee is the fastest way to start building reliable Al agent memory.
Latest
Cut Cognee's Vector Memory by 8x with Qdrant's TurboQuant
Use Qdrant TurboQuant in cognee with one env var to shrink stored vectors by about 8x without retraining, codebooks, or per-dataset tuning.
Long Term Memory AI: Why Your Agent Keeps Forgetting
Long term memory AI is more than chat history or larger context windows. Learn what agents should keep, retrieve, update, and forget.
Separate memories for organization, agent and user: Support AI Agent Use-Case
Most support teams don't have a support problem — they have a context problem. Here's how we built a support agent on top of cognee using user, agent, and organization memory.