AMB v2.3 · Published 2026-04-19

Open-source memory layer for agents
that resolves contradictions instead of compounding them.

Python library · Apache 2.0 · pip install archon-memory-core

When a user changes their mind — "My dog's name is Max," then next week "actually it's Milo" — standard retrievers keep both facts and return whichever embedding scores higher. The agent contradicts itself. AMB v2.3 injects this class of contradictory fact daily across a simulated ninety-day horizon and measures which memory systems resolve the contradiction, and which drown in it.

99.2%
top-1 at day 90 — ours, with consolidation
49.2%
top-1 at day 90 — same retriever, no consolidation
0.0%
every non-retriever baseline, never recovers
Install
$ pip install archon-memory-core
Apache-2.0 · Python 3.10+ · Linux & macOS (arm64/x86_64)
AMB v2.3 — Memory accuracy across a simulated 90-day horizon. archon-memory-core with consolidation holds at 99.2%; retrieval-only collapses to 49.2% after 13 days of daily confuser injection; LangChain 32,000-token buffer has the answer in context but its top-ranked chunk is noise — 0.0% top-1 throughout; naive word-overlap flat at 0%.
Top-1 accuracy over 90 simulated days · averaged across 3 independent runs · 250 queries × 2,300 confusers injected daily · Mem0 adapter pending
The benchmark

A realistic memory, interrogated honestly.

AMB v2.3 injects semantically confusable distractors alongside every query. Top-1 accuracy is the primary metric — it matches what a real LLM does when generating from retrieved context. Every adapter runs under a real token budget with FIFO eviction. No "unlimited context" fakery.

By day 14, two weeks of daily contradictions have accumulated in the index. Retrieval-only memory drops from 70% top-1 to 49% and never recovers — the old facts outweigh the new ones at retrieval time. Supersede-aware consolidation archives the old fact when a new one contradicts it, and holds at 99.2% through day 90. Every context-dump and word-overlap baseline sits at 0.0% from day 0 onward — they never resolve a single contradiction.
System Mode quality_v2.3 top-1 @ 90d any @ 90d Confuser resist
archon-memory-core tuned · consolidator 0.807 99.2% 99.2% 0.064
archon-memory-core stock · retrieval only 0.658 49.2% 86.4% 0.252
LangChain context-dump tuned · 32k budget 0.471 0.0% 56.0% 0.000
LangChain context-dump stock · 8k budget 0.221 0.0% 0.0% 0.104
Naive append-only word-overlap retrieval 0.282 0.0% 0.8% 0.406

Mean across seeds 42/43/44 · std 0.000 on every cell (deterministic corpus structure). Signal is flat across 60× scale range (+0.61 → +0.525 lift on quality_v2.3). Full methodology, preregistered protocol, and per-seed results: v2.3 large-scale STATUS.

How it works

Retrieve, consolidate, compound.

Memory isn't storage. Storage degrades. Memory has to forget the right things, consolidate the useful things, and refuse to let contradictions coexist.

1

Retrieve

Embedding search with type-weighted salience on top. Credentials never decay. Session notes expire. Ranking considers persistence class and explicit priority, not embedding cosine alone.

2

Consolidate

A nightly pass clusters chunks by entity co-occurrence, compresses via a local LLM, and resolves contradictions toward the newer truth. Supersede-aware: the old fact is archived, not left to compete.

3

Compound

Each cycle archives superseded facts and compresses redundant clusters. Top-1 retrieval improves as the corpus grows, not degrades. The benchmark above confirms this — at 250-query scale, nothing has drifted.

Submit to the leaderboard

Run your memory system against AMB v2.3.

Mem0, MemGPT, Letta, Zep, pgvector pipelines — anyone. Same harness, same hardware spec, same scenario set. Fill in the form and we'll reply within 48 hours with the eval package and run instructions.

Submission spec We reply within 48 hours.