What this is
The Coherence External Diagnostic Pipeline runs entirely on local hardware — no cloud APIs. Every model choice, prompt change, and scoring adjustment has measurable consequences. This is where we publish what we learn.
Gemma 4 vs Qwen3 on a Global Restaurant Chain
Two 30B-class models ran the full pipeline against the same 709-document corpus on parallel DGX Sparks. Speed, extraction quality, scoring fidelity, and diagnostic depth compared head-to-head.
709 documents
Gemma 4 0.498
Qwen3 0.432
Winner: Gemma 4 31B
Autoresearch: 73 Experiments to 0.000 Standard Deviation
Systematic pipeline hardening across 4 research tracks on distributed DGX Sparks. 30 hours of compute. Prompt tuning, scoring calibration, finding thresholds, and adversarial debate until the pipeline converged.
73 experiments
4 research tracks
30h compute
0.000 std dev