Skip to content
#

memory-benchmark

Here are 13 public repositories matching this topic...

The first open evaluation framework for AI continuity. 250 narrative tests, 1835 verification questions, 10 checkpoints. Benchmark for AI memory systems, stateful agents, and long-term context persistence. No LLM in the evaluation loop.

  • Updated Apr 21, 2026
  • Python

Research-grade evaluation & verification platform for LLM agents, RAG pipelines, and tool-using workflows — grading tool-choice optimality, state-transition correctness, memory hygiene, privilege safety, recovery behavior, and multi-agent coordination beyond answer scoring.

  • Updated Mar 23, 2026
  • Python

Improve this page

Add a description, image, and links to the memory-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the memory-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more