← Greenbaum Labs

Lab

Build notes from the diagnostic engine.

Experiments, benchmarks, and field notes from building a structural diagnostic pipeline on local inference infrastructure. Real models, real data, real conditions.

What this is

The Coherence External Diagnostic Pipeline runs entirely on local hardware — no cloud APIs. Every model choice, prompt change, and scoring adjustment has measurable consequences. This is where we publish what we learn.

Calibration April 25, 2026

When the Skeptic Was the Variable

Twenty-five experiments across two phases testing gemma4:31b against its own calibration. Prompt edits that lift sustain also break reproducibility. Larger model didn't win. The split is permanent.

25 experiments 11h26m compute 0 winners Decision: Lock split

Model Race April 7, 2026

Gemma 4 vs Qwen3 on a Global Restaurant Chain

Two 30B-class models ran the full pipeline against the same 709-document corpus on parallel DGX Sparks. Speed, extraction quality, scoring fidelity, and diagnostic depth compared head-to-head.

709 documents Gemma 4 0.498 Qwen3 0.432 Winner: Gemma 4 31B

Pipeline Hardening March 10, 2026

Autoresearch: 73 Experiments to 0.000 Standard Deviation

Systematic pipeline hardening across 4 research tracks on distributed DGX Sparks. 30 hours of compute. Prompt tuning, scoring calibration, finding thresholds, and adversarial debate until the pipeline converged.

73 experiments 4 research tracks 30h compute 0.000 std dev