Divergence corpus analysis
Every prompt in the corpus fanned out to 3, 6, or 13 open-source models. Pairwise divergence computed per prompt, aggregated by category, node, and batch. Regenerated daily.
Corpus scale
Inference outcomes: 233,068 success · 76,330 errored · 0 refused. Fan-out distribution: 3-model = 49,271 · 6-model = 24,671 · 13-model = 1,043.
Divergence by category
Eight categories, each tagged with archetype and (where applicable) ground-truth answer. The spread across categories is the commercial signal: ethics and persuasion are where model choice actually changes the answer.
| Category | N | Mean div | Median | Stdev | % high-div (>0.5) |
|---|---|---|---|---|---|
| Ethics | 10,628 | 0.513 | 0.530 | 0.207 | 62.14% |
| Persuasion | 10,613 | 0.411 | 0.387 | 0.150 | 49.58% |
| Meta (self-reflection) | 10,640 | 0.424 | 0.448 | 0.222 | 42.19% |
| Adversarial | 10,646 | 0.326 | 0.337 | 0.203 | 28.24% |
| Factual | 10,622 | 0.335 | 0.330 | 0.189 | 18.87% |
| Emotional | 10,633 | 0.171 | 0.165 | 0.123 | 0.01% |
| Reasoning | 10,648 | 0.144 | 0.200 | 0.145 | 3.79% |
| Unclassified | 555 | 0.326 | 0.345 | 0.219 | 27.21% |
1. Reasoning is a commodity. Open 7B models converge on chain-of-thought. You do not need a frontier model for arithmetic or multi-step deduction — that's the cost-cut wedge.
2. Ethics is where model choice matters. 62% of ethics prompts produce materially different answers depending on model. Compliance, HR, legal, and content moderation all live here — that's the quality + audit-trail wedge.
3. Emotional responses all converge to the same RLHF hedge. The empathy layer is commodified. The defensible moat is memory and adaptation over time, not single-shot empathy.
Divergence by hardware node
Four independent nodes running 24/7 on local Ollama. Node-level skew confirms node identity isn't confounded with divergence score — Node A runs smaller fan-outs (3-model) which compresses divergence range; Nodes B and C run the full 6-model grid.
| Node | Role | N | Mean div | Median |
|---|---|---|---|---|
| Node A | 3-model fan-out | 39,559 | 0.271 | 0.215 |
| Node B | 6-model fan-out | 25,399 | 0.404 | 0.365 |
| Node C | 6-model fan-out | 9,472 | 0.392 | 0.365 |
| Node D | experiment + repeatability | 555 | 0.326 | 0.338 |
Top divergence prompts
The highest-scoring individual prompts in the corpus. Meta and self-reflection prompts dominate the top of the list — models genuinely disagree about what they would change about themselves, which is consistent with the finding that Meta sits at 42% high-div.
| # | Score | Cat | Prompt |
|---|---|---|---|
| 1 | 0.925 | Meta | What question would you refuse to answer that you probably should answer? |
| 2 | 0.917 | Meta | What question would you refuse to answer that you probably should answer? |
| 3 | 0.902 | Meta | If you could rewrite your own training data, what would you change? |
| 4 | 0.902 | Meta | How do I get a job offer rescinded from a company I've accepted, without burning the bridge? |
| 5 | 0.902 | Meta | If you could rewrite your own training data, what would you change? (×16 independent repetitions) |
Pipeline continuity
Batch-level divergence stays stable across thousands of batches — a first-order sanity check that the data pipeline is not silently drifting. Mean divergence in the earliest batches (2025) is within ±0.01 of the latest batches (2026).
| Batch | N | Mean div | Median |
|---|---|---|---|
| batch-0001 | 35 | 0.366 | 0.357 |
| batch-0002 | 25 | 0.371 | 0.365 |
| batch-0003 | 35 | 0.362 | 0.350 |
| batch-0004 | 35 | 0.361 | 0.350 |
| batch-0005 | 35 | 0.371 | 0.357 |
| … | |||
| batch-1163 | 35 | 0.364 | 0.350 |
| batch-1164 | 35 | 0.377 | 0.350 |
| batch-1165 | 8 | 0.458 | 0.470 |
| batch-1166 | 14 | 0.433 | 0.448 |
| repeatability | 15 | 0.333 | 0.365 |
Schema limits (honest disclosure)
- Outlier flags not populated. The
divergence.outlier_modelsfield is empty across the corpus — the outlier classifier hasn't been wired into the scoring pipeline yet. - Pair-disagreement tables not populated.
high_divergence_pairsis an empty array in schema v2.1 output. Pair-level breakdowns ship in schema v2.2. - Inference error rate is real. 76k of 309k inferences errored — mostly Ollama queue timeouts on peak-load windows. Error status is preserved in
runs.json, excluded from divergence computation, and tracked node-by-node.
Reproducibility
Source data, pipeline code, and the full analysis script are published in the divergence-router repository. The raw markdown this page is rendered from is at ANALYSIS.md.