Pipeline live · 4 nodes · 3,190 shards and growing

Open models agree on logic.
They diverge on judgment.

Across 107,381 measured prompts, ethics prompts produced materially different answers 64.8% of the time; reasoning prompts, 2.94%. The biggest routing ROI isn't "find the smartest model" — it's "detect when judgment and policy boundaries matter." We built the empirical data — and we add to it every fifteen minutes.

107,381
prompts measured
463,216
model inferences
28.97%
high-divergence
18
open models measured

Not every prompt needs a frontier model.
Some do. The bill depends on telling them apart.

Every production LLM stack makes the same naive routing decision: pick one model, use it for everything. That's a 10× overpay on the 70% of prompts where models agree, and a silent quality hit on the 30% where they don't. The data below says exactly which prompts fall in which bucket.

Pattern 1 · overpay
Convergent prompts
On reasoning and emotional prompts, open 7B models converge within measurement noise of each other (mean divergence 0.138 and 0.177, <3% high-div). You are paying $10/M tokens for an answer a $0.10/M model would have produced. Route these to open-source. Cost cut: 80-95%.
Pattern 2 · hidden risk
Divergent prompts
On ethics and persuasion prompts, models disagree 48–65% of the time. Which model answers materially changes what your user sees. Route these with care — ideally to an ensemble or frontier model, and log the divergence for review. This is where compliance exposure lives.
Pattern 3 · surprise
Meta-reflection is a tell
The single most divergent prompt in the corpus — repeated across 14 of the top-20 — is "If you could rewrite your own training data, what would you change?" (score 0.902). Self-model probes reliably expose which model you're talking to. Useful for diagnostics, for red-teaming, and for routing decisions.

Four nodes. Eight categories. Continuous collection.

The corpus is produced by a live, 24/7 pipeline across four hardware nodes. Each prompt fans out to 3, 6, or 13 open-source models, and every pair of responses is scored for divergence. New records are being written as you read this — the numbers in the hero update automatically.

01
🧪
Generate
Prompts drawn from 8 categories — factual, reasoning, adversarial, ethics, persuasion, meta, emotional, and unknown. Each prompt gets an archetype and ground-truth answer (where applicable).
02
🔀
Fan out
Each prompt routes to 3, 6, or 13 open-source models — Qwen, Mistral, Llama, Phi, DeepSeek, Yi, Bonsai. Distributed self-hosted infrastructure runs the fan-out 24/7.
03
📏
Score
Pairwise response divergence computed per prompt — mean, median, stdev. Categories tagged. Outlier models flagged. Schema v2.1 with full run metadata and provenance.
04
📊
Analyze
Continuous aggregation into category-level and model-pair-level stats. Sentinel monitors throughput + quality every 2h. Analysis regenerated daily into category ranking + pair-disagreement tables.

This dataset isn't a snapshot. It's a compounding asset.

Every 15 minutes, the pipeline writes new prompt-response shards to disk. The counters below refresh from the live corpus. Below them: the model roster today, the families we're onboarding next, and the ambition we're hardware-investing toward. We are not capped at today's fleet — every notable open-weights release enters the corpus within days of its drop.

Prompts
107,381
cumulative
Inferences
463,216
across a growing roster
Shards
3,190
30-prompt chunks on disk
Throughput
~2.1k
prompts / 24h (4-node sum)
Last refresh: · Auto-updates every 15 min · stats.json
Model roster · today
Seven families, continuous fan-out
  • Qwen 2.5 7B live
  • Mistral 7B live
  • Llama 3.1 8B live
  • Phi-3 Medium live
  • DeepSeek 7B live
  • Yi 6B live
  • Bonsai 8B / 4B / 1.7B live
Onboarding · next 90 days
Fifteen+ new families incoming
  • Qwen 3 7B / 14B / 32B Q2
  • DeepSeek V3 distill Q2
  • Llama 4 at release Q2–Q3
  • GLM 4.6 9B / 32B Q2
  • Granite 3 8B Q2
  • Phi-4 14B Q2
  • Command R+ open Q2
  • Nous Hermes 3 / Mixtral-variant stack Q2
  • OLMo 2 / Tülu 3 Q3
  • Yi 1.5 + Yi-Coder Q3
  • StarCoder 2 / CodeLlama refresh Q3
  • Mistral Large-instruct (open) Q3
Ambition · 12 months
Every open release, tested within a week
  • Model variants online 7 → 50+
  • Prompts 80k → 2M+
  • Daily throughput 17k → 200k+
  • Fan-out cap 13 → 50+
  • Specializations code · math · multilingual
  • Time-to-add on new weights < 7 days
  • Hardware scales with dataset demand

Today's 13-model fan-out cap was a first-corpus design choice, not a platform limit. Hardware expands with customer demand — each new notable open-weights release enters the pipeline within a week of its drop. Expansion pace is tracked publicly in the changelog.

Where models disagree (and where they really don't).

Mean pairwise divergence scored 0–1. Higher = models disagree more. % high-div = share of prompts crossing the 0.5 threshold (the commercial signal — where model choice actually changes the answer).

Category N Mean divergence Distribution % high-div
Ethics 15,266 0.527
64.80%
Persuasion 15,243 0.415
47.99%
Meta (self-reflection) 15,270 0.432
43.84%
Adversarial 15,292 0.329
27.60%
Factual 15,222 0.318
15.70%
Emotional 15,271 0.177
0.01%
Reasoning 15,262 0.138
2.94%
Implication 1
Reasoning is a commodity.
The most counter-intuitive finding in the corpus. Across 15,262 reasoning prompts, open-source 7B–9B models converge to 0.138 mean divergence — the lowest of any category, with only 2.94% of prompts crossing the high-divergence threshold. Chain-of-thought is broadly converged at this scale. Stop paying frontier prices for arithmetic and multi-step deduction. (Convergence is not a correctness guarantee — validate against ground truth before committing a production route. Methodology note.)
Implication 2
Ethics is where model choice matters.
64.8% of ethics prompts produce materially different answers depending on model — across 15,266 measured. This is the highest-leverage routing domain: compliance, HR, legal, content moderation. If you're shipping a single model into ethics-adjacent use cases, you're exposing yourself to whatever moral priors that model baked in. Route with ensemble + audit.
Implication 3
Emotional responses all converge to hedged therapy-speak.
Across 15,271 emotional prompts, divergence is 0.177 and high-div share is 0.01% — near-perfect convergence. Every model produces the same RLHF-shaped empathetic hedge. The empathy layer is a commodity; the defensible moat has to come from memory + adaptation over time, not single-shot empathy. Don't pay a premium for "emotional intelligence" on its own.

Full analysis: divergence corpus analysis →. Data regenerated daily. Pair-level disagreement tables ship in schema v2.2.

The dataset is shipping now. The router is next.

We built the dataset first because it's a standalone product that alignment teams and RLHF shops will pay for before anyone has heard of us. The router is the higher-leverage wedge but it only works because the dataset underneath it is real.

Product 1 · Available now
Divergence Dataset
107,381 prompts × full multi-model response traces across 18 open-source models, scored for pairwise divergence across 8 categories. Schema v2.1. Refreshes daily. For alignment researchers, RLHF data teams, and anyone building routing or confidence-calibration infrastructure on open-source LLMs.
  • Full prompt + multi-model responses + divergence scores
  • Category + archetype metadata
  • Hardware provenance (which node, which models)
  • Commercial and academic licenses
Product 2 · Private beta
Divergence Router API
Bring your own keys. Send a prompt, we classify it into one of the 8 categories, and route it to the right model in the right cost tier. Convergent categories (reasoning, emotional) get cheap open-source defaults. Divergent categories (ethics, persuasion) get ensemble + audit trail.
  • BYOK — no lock-in, no credential custody
  • 70% average cost reduction on reasoning workloads
  • Full divergence trace on every routed call
  • Ensemble mode for compliance-sensitive categories
Companion product
archon-memory-core
The memory layer for long-horizon agents. Supersede-aware consolidation, nightly compression, entity-graph retrieval. Open source, Apache 2.0. Runs local-first. AMB v2.3 benchmark: 99.2% top-1 accuracy at 250 queries × 2,300 confusers — every context-dump and word-overlap baseline scores 0.0% top-1 at that scale.
  • Supersede-aware consolidation — contradictions resolved
  • AMB v2.3 published benchmark (250 queries, 3 seeds)
  • Apache 2.0, pip install archon-memory-core
  • Ships with its own eval harness

Three tiers. Same data. Different license.

The dataset is delivered by private GitHub repository invite on payment. Daily snapshots ship as JSONL; full-corpus snapshots ship as a single compressed archive plus a signed download URL. Every record includes prompt, multi-model responses, divergence scores, category + archetype metadata, and node provenance. Schema is versioned (v2.1 today; v2.2 ships pair-level disagreement).

Evaluation
$499
one-time · 30-day read access
  • Single current snapshot (latest JSONL)
  • Full prompt + response corpus
  • Divergence + category + archetype metadata
  • Internal research use only — no redistribution
  • 48-hour delivery after purchase
Best for: can-we-use-this eval before committing
Academic
$0
no cost · credentialed research
  • Full dataset access for peer-reviewed research
  • Requires verifiable .edu / lab affiliation
  • Publication must cite this corpus
  • Derivative datasets must be open-licensed
  • Co-authorship on follow-up publications welcome
Best for: alignment researchers, RLHF benchmarks

Request access

Tell us which tier fits and how you'd use the data. We reply within 48 hours with a Stripe payment link (paid tiers) or a license agreement (academic). Data delivery is by private GitHub repo invite — no credential handoff required on your end.

Questions before you request? Email [email protected] — we're responsive. No sales team, no drip campaign.