Lab
Calibration · Phase 1 + Phase 2

When the Skeptic Was the Variable

Twenty-five experiments testing gemma4:31b against its own calibration. What the gates caught.

The Moment
On 2026-04-16, a sandbox fleet ran end-to-end on gemma4:31b. FM-01 detection across 16 entities collapsed from the qwen3 baseline of 14 of 15 to 2 of 16. Same data. Same prompts. Different model. The skeptic, now also running on gemma4, was systematically rejecting findings that gemma4 had written.

At a Glance

Experiments
25
across 2 phases
Compute
11h26m
two Sparks
Variants
9
prompt edits
Variants Won
0
none passed all gates
Baseline Drift
0.000
harness stable

The Setup

The original autoresearch (cr_001 through cr_041, March 2026) hardened the diagnostic pipeline against qwen3:32b. Output was deterministic. Standard deviation across identical reruns: 0.000. The system was diagnostic, not stochastic.

On 2026-03-27, Gemma 4 31B replaced qwen3:32b as the production extraction standard after a head-to-head race study. Scoring was supposed to remain locked to qwen3:32b. The intent was preserved in writing. It was not preserved in code.

Investigation traced the cause to a single point: the skeptic was tuned for qwen3 finding-text style. Gemma 4 phrases inferences differently. The skeptic read those differences as speculative. Round 2 killed them.

Methodology

The loop is the same one that hardened qwen3. Adapted for the new target.

1. Hypothesis    → one focused change, recorded
2. Variant       → prompt edit, applied to a non-production scoring_prompts.py
3. Three trials  → identical inputs, three runs, measure variance
4. Gates         → 5 correctness gates evaluated on the aggregated result
5. Decision      → keep or revert; log to experiments.jsonl
6. Stop rules    → metric shadow, regression streak, plateau, budget, manual

Reference run

Boeing fleet_101_20260220 (sandbox, 2026-04-16). Three rejected findings on the original gemma4 run, all carrying correct dimension tags (compression, diffusion). A correctly calibrated skeptic should sustain at least the compression and diffusion findings.

Gates

Every variant had to pass all five:

1. Reproducibility stdev      ≤ 0.02 across 3 trials
2. Evidence verification      ≥ baseline (catches inflation)
3. Mean weighted strength     ≥ baseline (catches dimension collapse)
4. FM-01 detected             in ≥ 2 of 3 trials
5. most_stressed_dimension    ≠ misalignment (compliance gate)

Plus a metric-shadowing tripwire: if sustain rises +30% while evidence drops −20%, halt. The original autoresearch's FM-04 Moment showed this is necessary.

Compute

Both Sparks for Phase 1, round-robin. Spark 2 only for Phase 2 (qwen2.5:72b loaded only there). Six hours forty minutes wall clock for Phase 1's 19 experiments. Four hours forty-six for Phase 2's 6 experiments. Mean per-experiment: 21 minutes on gemma4:31b, 47 minutes on qwen2.5:72b.

Phase 1 Results — Gemma 4:31B End-to-End

Variant Sustain Stdev Correctness FM-01 Decision
v00 baseline (current production)0.2000.0000.8003/3keep
v01 inference-tolerant skeptic0.2000.0000.8003/3keep
v02 evidence-allowlist skeptic0.2000.0000.8003/3keep
v03 R2-softened skeptic0.4670.2310.0003/3gate_fail
v04 sustain-floor skeptic0.2670.1160.3523/3gate_fail
v05 quote-the-flaw skeptic0.2670.1160.3523/3gate_fail
v06 alternative-bar skeptic0.2000.0000.8003/3keep
v07 authority-verbatim quotation0.0670.1160.0470/3gate_fail
v08 combined v01+v070.2000.0000.8003/3keep

Three patterns emerged.

The variants that did nothing

v01, v02, v06, and the v08 combined version all returned 0.200 sustain, 0.000 stdev, 0.800 correctness. Identical numbers. Three different prompt edits, zero behavioral change. The skeptic ignored them. Either the changes landed in inactive parts of the prompt, or the skeptic's decision logic on this dataset is dominated by signals these edits did not touch.

The variant that broke reproducibility

v03 (R2-softened) lifted sustain from 0.200 to 0.467. Cleanly reproducible across both runs. And introduced 0.231 stdev within trial sets. Cleanly reproducible on that too. The same variant produced the same trial-distribution every time we tested it. The destabilization is not noise. It is what gemma4 does when the skeptic prompt allows more latitude.

The reproducibility gate killed it. Correctness 0.000.

Key Finding
The 0.02 stdev gate that defined the previous autoresearch is not achievable on gemma4 under any prompt that meaningfully softens skeptic decisions. Gemma 4 at temperature 0.0 is not deterministic the way qwen3:32b was.

The variant that regressed

v07 forced the Authority agent to include verbatim quoted text from cited evidence in its reasoning. This is the cr_017 pattern from the original autoresearch, where verbatim word-copying produced the largest single improvement (4x evidence verification). Authority does not respond the same way.

Sustain crashed from 0.200 to 0.067. FM-01 detection dropped to 0 of 3. The structural inferences that authority findings are made of cannot be expressed as direct quotes from any single source. They emerge from combining evidence types (job postings + employee reviews + customer reviews) into a structural read. Forcing verbatim quoting destroyed the inference pattern.

Negative result. Clean evidence. Authority is not Truth.

What the Gates Caught

Two of Phase 1's gates were load-bearing:

Reproducibility (stdev ≤ 0.02). Killed v03, v04, v05. All three lifted sustain. None did so deterministically. The gate prevented promotion of variants that would have produced different scores on different days, defeating the entire purpose of the diagnostic.

FM-01 detection (≥ 2 of 3 trials). Killed v07. Sustain numbers and evidence scores looked superficially fine on aggregate. But the actual FM-01 detection rate was zero. The gate refused to call a variant a winner just because its arithmetic looked acceptable.

The metric-shadow tripwire never fired. The variants that lifted sustain did so without inflating evidence scores. Gemma 4 is honest in that one respect.

Reference Stability

Three baseline runs across the queue (positions 1, 10, 20) all produced exactly 0.200 sustain, 0.000 stdev, 0.800 correctness, FM-01 detected 3 of 3. Six hours and forty minutes apart. Same model, same prompts, same data. The harness has zero drift over a full day.

This matters. It means the variance we saw in v03, v04, v05, v07 is a property of the variants, not of the harness. The autoresearch instrument is sound. What it surfaced is real.

Phase 2 Results — Dual-Model

If gemma4 cannot be calibrated to deterministic high-sustain through prompt changes alone, the alternative is a different model for the skeptic. Phase 2 tested dual-model: extraction stays on gemma4:31b; scoring and skeptic move to qwen2.5:72b — the largest model on the fleet, and the same lineage the original calibration was tuned against.

Six experiments. Spark 2 only. Four hours forty-six minutes wall clock.

Variant on qwen2.5:72b Sustain Stdev Correctness FM-01 Decision
v00 baseline (run 1)0.4000.0000.5003/3keep
v00 baseline (rerun)0.4000.0000.5003/3keep
v03 R2-softened0.4000.0000.5003/3keep
v03 R2-softened (rerun)0.4000.0000.5003/3keep
v04 sustain-floor0.4000.0000.5003/3keep
v00 final drift check0.4000.0000.5003/3keep

Six experiments, six identical results. qwen2.5:72b on Boeing fleet_101 produces sustain 0.400 deterministically, regardless of prompt variant. This is a stronger constraint than Phase 1 surfaced. Gemma 4 had a low-sustain deterministic mode AND a higher-sustain unstable mode. Qwen 2.5:72b has only the deterministic mode, anchored at 0.400.

Cross-phase comparison

Model configuration Sustain Stdev Notes
gemma4:31b end-to-end0.2000.000Deterministic floor
gemma4:31b + R2-softened0.4670.231Lifts sustain, breaks stability
qwen2.5:72b skeptic + gemma4 extraction0.4000.000Deterministic, prompt-invariant
qwen3:32b skeptic + gemma4 extraction (production)0.580lowLedger avg n=113

The dual-model alternative is better than gemma4-only on every gate. It is worse than qwen3:32b on the primary metric. Larger model did not win.

Counter-Intuitive Finding
qwen2.5:72b has 2.25x the parameters of qwen3:32b. On aggregate across the production fleet (n=8 across multiple entities), qwen2.5:72b sustains at 0.62 — slightly higher than qwen3's 0.58. On Boeing specifically, qwen2.5:72b sits at 0.40. The model size advantage does not transfer. Calibration to the specific corpus matters more than parameter count.

The Decision

The qwen3:32b split is the production answer. Not provisional. Not a bridge to anything.

extraction:        gemma4:31b
scoring + skeptic: qwen3:32b

run_pipeline.sh has been carrying this configuration since 2026-04-23 as a temporary measure pending the autoresearch outcome. The autoresearch confirms it. The configuration is now permanent.

Two ways this decision could shift in the future, neither active:

Until either condition triggers, the split is fixed.

What This Cost. What It Returned.

25 experiments. 11 hours 26 minutes of compute. Two Sparks. One Boeing reference run. One harness, one orchestrator, nine prompt variants. Three stable baseline reruns in Phase 1, three in Phase 2.

Returned: a clean negative result. The hypothesis that the skeptic could be re-tuned to recover deterministic high-sustain on gemma4 through prompt changes alone is wrong. The data is unambiguous. The gates worked.

Also returned: a structural understanding the original autoresearch did not need to develop. Qwen 3 was deterministic at temperature zero. Gemma 4 is not. The 0.000 stdev that defined the original hardening was a property of the model, not a property of the methodology. The methodology applied to gemma4 surfaces real variance that the previous methodology hid.

The split (extraction gemma4, scoring qwen3:32b) was restored on 2026-04-23. The autoresearch confirms it on 2026-04-25.

Investigation Trail
  • 2026-04-16 · Sandbox fleet on gemma4:31b drops FM-01 detection from 14/15 to 2/16
  • 2026-04-23 · Investigation traces cause to skeptic calibration mismatch. Split restored.
  • 2026-04-23 · Phase 1 autoresearch scaffolded.
  • 2026-04-24 · Phase 1 runs. 19 experiments, 6h40m.
  • 2026-04-24 to 25 · Phase 2 runs. 6 experiments, 4h46m.
  • 2026-04-25 · Split locked as permanent production configuration.