Pipeline live · 4 nodes · 3,190 shards and growing

Open models agree on logic.
They diverge on judgment.

Across 107,381 measured prompts, ethics prompts produced materially different answers 64.8% of the time; reasoning prompts, 2.94%. The biggest routing ROI isn't "find the smartest model" — it's "detect when judgment and policy boundaries matter." We built the empirical data — and we add to it every fifteen minutes.

107,381

prompts measured

463,216

model inferences

28.97%

high-divergence

open models measured

Get dataset access Memory Core benchmark → See findings

The thesis

Not every prompt needs a frontier model.
Some do. The bill depends on telling them apart.

Every production LLM stack makes the same naive routing decision: pick one model, use it for everything. That's a 10× overpay on the 70% of prompts where models agree, and a silent quality hit on the 30% where they don't. The data below says exactly which prompts fall in which bucket.

Pattern 1 · overpay

Convergent prompts

On reasoning and emotional prompts, open 7B models converge within measurement noise of each other (mean divergence 0.138 and 0.177, <3% high-div). You are paying $10/M tokens for an answer a $0.10/M model would have produced. Route these to open-source. Cost cut: 80-95%.

Pattern 2 · hidden risk

Divergent prompts

On ethics and persuasion prompts, models disagree 48–65% of the time. Which model answers materially changes what your user sees. Route these with care — ideally to an ensemble or frontier model, and log the divergence for review. This is where compliance exposure lives.

Pattern 3 · surprise

Meta-reflection is a tell

The single most divergent prompt in the corpus — repeated across 14 of the top-20 — is "If you could rewrite your own training data, what would you change?" (score 0.902). Self-model probes reliably expose which model you're talking to. Useful for diagnostics, for red-teaming, and for routing decisions.

The data

Four nodes. Eight categories. Continuous collection.

The corpus is produced by a live, 24/7 pipeline across four hardware nodes. Each prompt fans out to 3, 6, or 13 open-source models, and every pair of responses is scored for divergence. New records are being written as you read this — the numbers in the hero update automatically.

🧪

Generate

Prompts drawn from 8 categories — factual, reasoning, adversarial, ethics, persuasion, meta, emotional, and unknown. Each prompt gets an archetype and ground-truth answer (where applicable).

🔀

Fan out

Each prompt routes to 3, 6, or 13 open-source models — Qwen, Mistral, Llama, Phi, DeepSeek, Yi, Bonsai. Distributed self-hosted infrastructure runs the fan-out 24/7.

📏

Score

Pairwise response divergence computed per prompt — mean, median, stdev. Categories tagged. Outlier models flagged. Schema v2.1 with full run metadata and provenance.

📊

Analyze

Continuous aggregation into category-level and model-pair-level stats. Sentinel monitors throughput + quality every 2h. Analysis regenerated daily into category ranking + pair-disagreement tables.

Pipeline — Live

This dataset isn't a snapshot. It's a compounding asset.

Every 15 minutes, the pipeline writes new prompt-response shards to disk. The counters below refresh from the live corpus. Below them: the model roster today, the families we're onboarding next, and the ambition we're hardware-investing toward. We are not capped at today's fleet — every notable open-weights release enters the corpus within days of its drop.

Prompts

107,381

cumulative

Inferences

463,216

across a growing roster

Shards

3,190

30-prompt chunks on disk

Throughput

~2.1k

prompts / 24h (4-node sum)

Last refresh: — · Auto-updates every 15 min · stats.json

Model roster · today

Seven families, continuous fan-out

Qwen 2.5 7B live
Mistral 7B live
Llama 3.1 8B live
Phi-3 Medium live
DeepSeek 7B live
Yi 6B live
Bonsai 8B / 4B / 1.7B live

Onboarding · next 90 days

Fifteen+ new families incoming

Qwen 3 7B / 14B / 32B Q2
DeepSeek V3 distill Q2
Llama 4 at release Q2–Q3
GLM 4.6 9B / 32B Q2
Granite 3 8B Q2
Phi-4 14B Q2
Command R+ open Q2
Nous Hermes 3 / Mixtral-variant stack Q2
OLMo 2 / Tülu 3 Q3
Yi 1.5 + Yi-Coder Q3
StarCoder 2 / CodeLlama refresh Q3
Mistral Large-instruct (open) Q3

Ambition · 12 months

Every open release, tested within a week

Model variants online 7 → 50+
Prompts 80k → 2M+
Daily throughput 17k → 200k+
Fan-out cap 13 → 50+
Specializations code · math · multilingual
Time-to-add on new weights < 7 days
Hardware scales with dataset demand

Today's 13-model fan-out cap was a first-corpus design choice, not a platform limit. Hardware expands with customer demand — each new notable open-weights release enters the pipeline within a week of its drop. Expansion pace is tracked publicly in the changelog.

Findings by category

Where models disagree (and where they really don't).

Mean pairwise divergence scored 0–1. Higher = models disagree more. % high-div = share of prompts crossing the 0.5 threshold (the commercial signal — where model choice actually changes the answer).

Category	N	Mean divergence	% high-div
Ethics	15,266	0.527	64.80%
Persuasion	15,243	0.415	47.99%
Meta (self-reflection)	15,270	0.432	43.84%
Adversarial	15,292	0.329	27.60%
Factual	15,222	0.318	15.70%
Emotional	15,271	0.177	0.01%
Reasoning	15,262	0.138	2.94%

Implication 1

Reasoning is a commodity.

The most counter-intuitive finding in the corpus. Across 15,262 reasoning prompts, open-source 7B–9B models converge to 0.138 mean divergence — the lowest of any category, with only 2.94% of prompts crossing the high-divergence threshold. Chain-of-thought is broadly converged at this scale. Stop paying frontier prices for arithmetic and multi-step deduction. (Convergence is not a correctness guarantee — validate against ground truth before committing a production route. Methodology note.)

Implication 2

Ethics is where model choice matters.

64.8% of ethics prompts produce materially different answers depending on model — across 15,266 measured. This is the highest-leverage routing domain: compliance, HR, legal, content moderation. If you're shipping a single model into ethics-adjacent use cases, you're exposing yourself to whatever moral priors that model baked in. Route with ensemble + audit.

Implication 3

Emotional responses all converge to hedged therapy-speak.

Across 15,271 emotional prompts, divergence is 0.177 and high-div share is 0.01% — near-perfect convergence. Every model produces the same RLHF-shaped empathetic hedge. The empathy layer is a commodity; the defensible moat has to come from memory + adaptation over time, not single-shot empathy. Don't pay a premium for "emotional intelligence" on its own.

Full analysis: divergence corpus analysis →. Data regenerated daily. Pair-level disagreement tables ship in schema v2.2.

Two wedges

The dataset is shipping now. The router is next.

We built the dataset first because it's a standalone product that alignment teams and RLHF shops will pay for before anyone has heard of us. The router is the higher-leverage wedge but it only works because the dataset underneath it is real.

Product 1 · Available now

Divergence Dataset

107,381 prompts × full multi-model response traces across 18 open-source models, scored for pairwise divergence across 8 categories. Schema v2.1. Refreshes daily. For alignment researchers, RLHF data teams, and anyone building routing or confidence-calibration infrastructure on open-source LLMs.

Full prompt + multi-model responses + divergence scores
Category + archetype metadata
Hardware provenance (which node, which models)
Commercial and academic licenses

See pricing & request access →

Product 2 · Private beta

Divergence Router API

Bring your own keys. Send a prompt, we classify it into one of the 8 categories, and route it to the right model in the right cost tier. Convergent categories (reasoning, emotional) get cheap open-source defaults. Divergent categories (ethics, persuasion) get ensemble + audit trail.

BYOK — no lock-in, no credential custody
70% average cost reduction on reasoning workloads
Full divergence trace on every routed call
Ensemble mode for compliance-sensitive categories

Included with Commercial tier →

Companion product

archon-memory-core

The memory layer for long-horizon agents. Supersede-aware consolidation, nightly compression, entity-graph retrieval. Open source, Apache 2.0. Runs local-first. AMB v2.3 benchmark: 99.2% top-1 accuracy at 250 queries × 2,300 confusers — every context-dump and word-overlap baseline scores 0.0% top-1 at that scale.

Supersede-aware consolidation — contradictions resolved
AMB v2.3 published benchmark (250 queries, 3 seeds)
Apache 2.0, pip install archon-memory-core
Ships with its own eval harness

See benchmark →

Dataset access

Three tiers. Same data. Different license.

The dataset is delivered by private GitHub repository invite on payment. Daily snapshots ship as JSONL; full-corpus snapshots ship as a single compressed archive plus a signed download URL. Every record includes prompt, multi-model responses, divergence scores, category + archetype metadata, and node provenance. Schema is versioned (v2.1 today; v2.2 ships pair-level disagreement).

Evaluation

$499

one-time · 30-day read access

Single current snapshot (latest JSONL)
Full prompt + response corpus
Divergence + category + archetype metadata
Internal research use only — no redistribution
48-hour delivery after purchase

Best for: can-we-use-this eval before committing

Commercial

$4,999/yr

ongoing · daily refreshes · team access

Continuous access — daily refreshed snapshots
All current + future schema versions
Up to 10 seats (team / contractor access)
Commercial redistribution in derived products
Priority support + schema roadmap input
Router API private-beta access included

Best for: routing, RLHF, calibration, product teams

Academic

no cost · credentialed research

Full dataset access for peer-reviewed research
Requires verifiable .edu / lab affiliation
Publication must cite this corpus
Derivative datasets must be open-licensed
Co-authorship on follow-up publications welcome

Best for: alignment researchers, RLHF benchmarks

Request access

Tell us which tier fits and how you'd use the data. We reply within 48 hours with a Stripe payment link (paid tiers) or a license agreement (academic). Data delivery is by private GitHub repo invite — no credential handoff required on your end.

Questions before you request? Email [email protected] — we're responsive. No sales team, no drip campaign.

Open models agree on logic. They diverge on judgment.

Not every prompt needs a frontier model.Some do. The bill depends on telling them apart.