Lab
Pipeline Hardening

Autoresearch:
73 Experiments to 0.000

Systematic pipeline hardening through adversarial self-research. 4 research tracks, 30 hours of distributed compute across 3 nodes, and an emergent result nobody planned for: perfect reproducibility.

Experiments
73
Tracks
4
Stdev
0.000
Scoring Gain
11.5x
Extraction Gain
1.6x
Compute
~30h

When the System Found Itself

One of our framework's failure modes describes Metric Shadowing—when a measurement becomes the target and ceases to be useful. The pipeline diagnosed this condition on itself during Track 1.

What Happened

The v1 scoring metric rewarded throughput. The system optimized for highest score—but the adversarial agent had been weakened so much it was barely challenging anything. The metric went up. The quality went down.

Why This Matters

The pipeline diagnosed metric shadowing on itself. The same framework built to detect this condition in organizations caught it in its own scoring logic.

The fix was structural: replace the metric entirely. The v2 correctness metric uses reproducibility as a gate and verifies that evidence actually supports each conclusion. You cannot game evidence verification by weakening the adversarial agent.

The Autoresearch Loop

Step 1
Hypothesis
Agent proposes a specific change to the scoring or extraction logic. One variable at a time.
Step 2
Modification
Configuration modified programmatically. Diff recorded.
Step 3
Triple-Blind Trial
3 independent trials against the same entity corpus. Same evidence, different scoring runs.
Step 4
Measurement
Mean, stdev, and component metrics computed and logged to JSONL.
Step 5
Decision
Score improves → keep. Regresses → revert. Propose next hypothesis.
The Correctness Metric (v2)

The v2 metric is a composite that gates on reproducibility first, then evaluates evidence quality and diagnostic significance. It rewards fewer, higher-quality findings over a larger volume of weaker ones.

Compute Distribution

Experiment Distribution Across Tracks & Devices
  • Spark 1 (DGX): Scoring v1 (13) + Scoring v2 (41) = 54 experiments
  • Spark 2 (DGX): Extraction (19 experiments)
  • M2 Studio: Cross-entity validation (9 trials)
  • Total compute: ~30 hours across 3 nodes, all local qwen3:32b inference

Track 1: The Throughput Metric

Scoring V1 · 13 Experiments · Spark 1
Scoring V1 Trajectory
Gold line = composite score. Blue bars = skeptic throughput. Red markers = experiments where the skeptic was too lenient. Most changes had zero measurable effect. The metric was abandoned after ar_013.

Track 2: The Correctness Metric

Scoring V2 · 41 Experiments · Spark 1
Correctness Score Across 41 Experiments
Foundation (cr_001–012) Breakthroughs (cr_013–025) Maximum Extraction (cr_026–041) — Dashed line = running best. 0.058 baseline → 0.667 ceiling = 11.5x improvement.

Four Phases of Discovery

Phase 1 Foundation. Baseline 0.058. Explored evidence-first processes, citation verification, dimension-specific matching. Incremental gains to 0.156 (2.7x).

Phase 2 Evidence Breakthrough. A single change to how the model grounds its reasoning in source evidence (cr_017) jumped evidence verification scores from ~0.17 to 0.50. Score: 0.233, a 4x improvement over baseline.

Phase 3 Structural Concentration. Concentrating the scoring on fewer diagnostic dimensions (cr_024) jumped to 0.400. Fewer findings of higher quality beat more findings of mixed quality.

Phase 4 Maximum Extraction. Five different experiments reached the 0.667 ceiling through different mechanisms but the same outcome: one perfect finding with maximum diagnostic weight and verified evidence.

The Verification Bottleneck

Evidence Score vs. Correctness
Evidence verification is the primary driver of correctness. The bottleneck is not reasoning—it is whether evidence actually supports the conclusion.
Sustained Findings vs. Diagnostic Weight
As quality concentrated into fewer findings, diagnostic weight rose. The 0.667 cluster sustains 1 finding at maximum weight.

Track 3: Extraction

19 Experiments · Spark 2
Extraction Score vs. Total Items Extracted
Inverse relationship. Baseline: 57 items, 0.283 score. Optimal (arx_017): 45 items, 0.450 score. Star = best result with calibrated extraction limits.

98% of extracted items are never cited in any finding. Broad extraction ensures coverage; aggressive filtering ensures quality.

Some document types are scarce but high-signal. Setting asymmetric extraction limits preserved those signals while reducing noise from higher-volume sources.

Extraction Score Trajectory
Red markers = downstream penalties from overly rigid extractions. Pattern: upstream rigidity cascades into downstream scoring failures. The extraction and scoring systems are coupled.

The Bullwhip Effect in Diagnostics

In supply chains, small demand signals become large order swings at the factory. The same pattern appears in the pipeline: upstream extraction instability cascades into downstream scoring failures.

Score Variance by Pipeline Stage Across Experiments
Each stage shows the range (min–max) and interquartile spread of scores across all experiments. Variance amplifies at each downstream stage. The adversarial agent acts as a variance dampener, collapsing the spread at the output boundary.
The Bullwhip Pattern

A 1.6x extraction improvement produced an 11.5x scoring improvement. Small upstream changes create large downstream effects.

Cascade Evidence

4 of 19 extraction experiments triggered downstream penalties. Rigid or verbose extractions didn't just score poorly — they poisoned the scoring stage.

It's Not the Model — It's the System

Each infrastructure layer compounds performance. The same model (qwen3:32b) with different system layers produces dramatically different results.

Correctness by Setup Layer (Same Model: qwen3:32b)
Each bar = best result at that layer from 73 experiments. Single model throughout. Zero API cost. The 11.5x improvement comes entirely from system-level engineering — not the model itself, and portable across models.

Track 4: Cross-Entity Validation

9 Trials · M2 Studio
Cross-Entity Correctness Scores
Three entities, three sectors, three evidence profiles. Reproducibility is perfect across all. Score differences reflect evidence quality, not overfitting.
0.0
Sustained Count Stdev
0.997
Mean Reproducibility
3/3
Entities Confirmed Stable

Perfect Reproducibility

Standard Deviation Across All 73 Experiments = 0.000
Every experiment produced identical results across all trials. The system is fully deterministic given the same inputs. This was not a design goal—it is an emergent property of careful system engineering and fixed evidence corpora.

What the System Taught Us

  • 1. Evidence verification is the bottleneck, not reasoning. A single change to how evidence is grounded produced the largest single improvement (4x). The model reasons well—it just doesn't naturally anchor reasoning to source material.
  • 2. Constraints hurt more than they help. Rigid formatting and structural constraints produced regressions in 11 experiments. Tell the model what to attend to, not how to format.
  • 3. Not all findings carry equal weight. Some categories of findings are diagnostically inert. Removing low-value categories entirely produced the highest scores.
  • 4. Extraction has diminishing returns. 1.6x extraction improvement vs. 11.5x scoring improvement. Once extraction quality crosses a threshold, further upstream optimization yields minimal downstream gain.
  • 5. The metric itself requires scrutiny. The v2 metric rewards one perfect finding over several good ones. Whether to optimize for depth or breadth is a research decision that should be made deliberately.

Where the Leverage Is

Scoring vs. Extraction Improvement
Scoring changes dominate. Resource allocation should reflect the leverage difference.
11.5x
Scoring Improvement
1.6x
Extraction Improvement

What Ships

Scoring Configuration
cr_021 era · 0.267 correctness
Extraction Configuration
arx_017 · 0.450 throughput

Why cr_021 and not cr_032? The 0.667 experiments achieve scores through extreme concentration—a single finding. The cr_021 config produces 3 sustained findings across multiple dimensions. It is the more diagnostically useful output.

  • Scoring: Rewritten to produce concise, evidence-grounded findings with calibrated quality thresholds.
  • Extraction: Asymmetric limits tuned per source type. Adversarial filtering. Calibrated confidence thresholds.
  • Metric: Production scoring standard locked in from this research.
  • Reproducibility: 0.000 stdev confirmed across all configs and entities.
AR-001
“Automation may observe, measure, and report. It may not decide, approve, or publish.”