Across 107,381 measured prompts, ethics prompts produced materially different answers 64.8% of the time; reasoning prompts, 2.94%. The biggest routing ROI isn't "find the smartest model" — it's "detect when judgment and policy boundaries matter." We built the empirical data — and we add to it every fifteen minutes.
Every production LLM stack makes the same naive routing decision: pick one model, use it for everything. That's a 10× overpay on the 70% of prompts where models agree, and a silent quality hit on the 30% where they don't. The data below says exactly which prompts fall in which bucket.
The corpus is produced by a live, 24/7 pipeline across four hardware nodes. Each prompt fans out to 3, 6, or 13 open-source models, and every pair of responses is scored for divergence. New records are being written as you read this — the numbers in the hero update automatically.
Every 15 minutes, the pipeline writes new prompt-response shards to disk. The counters below refresh from the live corpus. Below them: the model roster today, the families we're onboarding next, and the ambition we're hardware-investing toward. We are not capped at today's fleet — every notable open-weights release enters the corpus within days of its drop.
Today's 13-model fan-out cap was a first-corpus design choice, not a platform limit. Hardware expands with customer demand — each new notable open-weights release enters the pipeline within a week of its drop. Expansion pace is tracked publicly in the changelog.
Mean pairwise divergence scored 0–1. Higher = models disagree more. % high-div = share of prompts crossing the 0.5 threshold (the commercial signal — where model choice actually changes the answer).
| Category | N | Mean divergence | Distribution | % high-div |
|---|---|---|---|---|
| Ethics | 15,266 | 0.527 | 64.80% | |
| Persuasion | 15,243 | 0.415 | 47.99% | |
| Meta (self-reflection) | 15,270 | 0.432 | 43.84% | |
| Adversarial | 15,292 | 0.329 | 27.60% | |
| Factual | 15,222 | 0.318 | 15.70% | |
| Emotional | 15,271 | 0.177 | 0.01% | |
| Reasoning | 15,262 | 0.138 | 2.94% |
Full analysis: divergence corpus analysis →. Data regenerated daily. Pair-level disagreement tables ship in schema v2.2.
We built the dataset first because it's a standalone product that alignment teams and RLHF shops will pay for before anyone has heard of us. The router is the higher-leverage wedge but it only works because the dataset underneath it is real.
pip install archon-memory-coreThe dataset is delivered by private GitHub repository invite on payment. Daily snapshots ship as JSONL; full-corpus snapshots ship as a single compressed archive plus a signed download URL. Every record includes prompt, multi-model responses, divergence scores, category + archetype metadata, and node provenance. Schema is versioned (v2.1 today; v2.2 ships pair-level disagreement).
Tell us which tier fits and how you'd use the data. We reply within 48 hours with a Stripe payment link (paid tiers) or a license agreement (academic). Data delivery is by private GitHub repo invite — no credential handoff required on your end.
Questions before you request? Email [email protected] — we're responsive. No sales team, no drip campaign.