Greenbaum Labs
Lab

Build notes from the diagnostic engine.

Experiments, benchmarks, and field notes from building a structural diagnostic pipeline on local inference infrastructure. Real models, real data, real conditions.

What this is
The Coherence External Diagnostic Pipeline runs entirely on local hardware — no cloud APIs. Every model choice, prompt change, and scoring adjustment has measurable consequences. This is where we publish what we learn.
Calibration April 25, 2026
When the Skeptic Was the Variable
Twenty-five experiments across two phases testing gemma4:31b against its own calibration. Prompt edits that lift sustain also break reproducibility. Larger model didn't win. The split is permanent.
25 experiments 11h26m compute 0 winners Decision: Lock split
Model Race April 7, 2026
Gemma 4 vs Qwen3 on a Global Restaurant Chain
Two 30B-class models ran the full pipeline against the same 709-document corpus on parallel DGX Sparks. Speed, extraction quality, scoring fidelity, and diagnostic depth compared head-to-head.
709 documents Gemma 4 0.498 Qwen3 0.432 Winner: Gemma 4 31B
Pipeline Hardening March 10, 2026
Autoresearch: 73 Experiments to 0.000 Standard Deviation
Systematic pipeline hardening across 4 research tracks on distributed DGX Sparks. 30 hours of compute. Prompt tuning, scoring calibration, finding thresholds, and adversarial debate until the pipeline converged.
73 experiments 4 research tracks 30h compute 0.000 std dev