When a user changes their mind — "My dog's name is Max," then next week "actually it's Milo" — standard retrievers keep both facts and return whichever embedding scores higher. The agent contradicts itself. AMB v2.3 injects this class of contradictory fact daily across a simulated ninety-day horizon and measures which memory systems resolve the contradiction, and which drown in it.
AMB v2.3 injects semantically confusable distractors alongside every query. Top-1 accuracy is the primary metric — it matches what a real LLM does when generating from retrieved context. Every adapter runs under a real token budget with FIFO eviction. No "unlimited context" fakery.
By day 14, two weeks of daily contradictions have accumulated in the index. Retrieval-only memory drops from 70% top-1 to 49% and never recovers — the old facts outweigh the new ones at retrieval time. Supersede-aware consolidation archives the old fact when a new one contradicts it, and holds at 99.2% through day 90. Every context-dump and word-overlap baseline sits at 0.0% from day 0 onward — they never resolve a single contradiction.
| System | Mode | quality_v2.3 | top-1 @ 90d | any @ 90d | Confuser resist |
|---|---|---|---|---|---|
| archon-memory-core | tuned · consolidator | 0.807 | 99.2% | 99.2% | 0.064 |
| archon-memory-core | stock · retrieval only | 0.658 | 49.2% | 86.4% | 0.252 |
| LangChain context-dump | tuned · 32k budget | 0.471 | 0.0% | 56.0% | 0.000 |
| LangChain context-dump | stock · 8k budget | 0.221 | 0.0% | 0.0% | 0.104 |
| Naive append-only | word-overlap retrieval | 0.282 | 0.0% | 0.8% | 0.406 |
Mean across seeds 42/43/44 · std 0.000 on every cell (deterministic corpus structure). Signal is flat across 60× scale range (+0.61 → +0.525 lift on quality_v2.3). Full methodology, preregistered protocol, and per-seed results: v2.3 large-scale STATUS.
Memory isn't storage. Storage degrades. Memory has to forget the right things, consolidate the useful things, and refuse to let contradictions coexist.
Embedding search with type-weighted salience on top. Credentials never decay. Session notes expire. Ranking considers persistence class and explicit priority, not embedding cosine alone.
A nightly pass clusters chunks by entity co-occurrence, compresses via a local LLM, and resolves contradictions toward the newer truth. Supersede-aware: the old fact is archived, not left to compete.
Each cycle archives superseded facts and compresses redundant clusters. Top-1 retrieval improves as the corpus grows, not degrades. The benchmark above confirms this — at 250-query scale, nothing has drifted.
Mem0, MemGPT, Letta, Zep, pgvector pipelines — anyone. Same harness, same hardware spec, same scenario set. Fill in the form and we'll reply within 48 hours with the eval package and run instructions.