Evaluation Results

📄 Get the research paper

We present a practical approach to optimizing the interface between knowledge graphs and LLMs for complex reasoning. The paper reports systematic hyperparameter optimization using Cognee's modular framework across multiple QA benchmarks, with reproducible settings.

AI Memory Benchmark Results

Understanding how different AI memory systems retain and use context across interactions is crucial for LLM performance.

We updated our benchmark to include a comprehensive evaluation of the Cognee AI memory system alongside LightRAG, Mem0, and Graphiti (previous result).

This page provides a detailed comparison of performance metrics to help developers select the best AI memory solution for their applications.

Key performance metrics

Results for cognee

0.93

Human-like correctness

0.85

DeepEval correctness

0.84

DeepEval F1

0.69

DeepEval EM

Benchmark comparison

Optimized Cognee configurations

Cognee Graph Completion with chain-of-thought (CoT) shows significant performance improvements over the previous non-optimized version:

Human-like Correctness: +25% (0.738 → 0.925)
DeepEval Correctness: +49% (0.569 → 0.846)
DeepEval F1: +314% (0.203 → 0.841)
DeepEval EM: +1618% (0.04 → 0.687)

Comprehensive metrics comparison

Dive deeper

Read Our Blog PostInsights into our evaluation methods, implications for AI development, and a deeper analysis of results.

Access the Benchmark CodeAvailable on GitHub to replicate and validate our evaluations independently.

What's next?

We're expanding these benchmarks: adding datasets, task types, and stronger correctness/faithfulness metrics. We're also evaluating additional memory systems and publishing reproducible configurations.

Questions or need help optimizing your AI system?

Last updated: August 4th, 2025