What this tests
The pipeline runs entirely on local inference — no cloud APIs. Model choice directly affects extraction speed, classification accuracy, scoring fidelity, and diagnostic depth. This race measures all four dimensions on a real entity with a large, noisy corpus.
At a Glance
Entity
Global Restaurant Chain
anonymized
Corpus Size
709
7 center + 593 edge documents
Data Quality
C
No employee reviews or job postings
Total Runtime
4h 36m
16,578 seconds
Total Runtime
5h 12m
18,753 seconds
Most Stressed Vertex
Truth
Omission
Most Stressed Vertex
Authority
Misalignment
Both models identified real structural conditions in the same organization, but they stressed different vertices and told different stories about why. That divergence is itself a finding.
Speed
Gemma 4 was consistently faster across every pipeline phase except synthesis. On a DGX Spark running Ollama, it processed the same 709 documents 14% faster end-to-end.
Total Tokens
1,065K
1,037K
Effective t/s
~66 t/s
~57 t/s
Gemma 4's speed advantage comes from extraction — the longest phase. It processes documents faster per-token despite generating slightly more total tokens. Qwen3 was faster only at synthesis, where it produced a more compressed case summary.
Extraction Quality
Both models extracted the same number of claims (23) from the 7 center documents. The divergence is in edge extraction and classification discipline.
Total Segments
1,513
2,109
Qwen3 extracted 35% more observations, particularly from customer reviews (985 vs 782) and social media (947 vs 657). But it also produced 72 segments it couldn't classify — all in social media and news. Gemma 4 had zero classification failures across the entire corpus.
More extraction isn't necessarily better. The question is whether the additional volume improves diagnostic signal or adds noise. In this case, Qwen3's extra observations did not produce more sustained findings or higher scores. The additional volume made the corpus noisier without sharpening the diagnosis.
Source Breakdown
Press Releases
14 claims
17 claims
Customer Reviews
782 obs
985 obs
Social Media
657 obs, 0 unc
947 obs, 22 unc
News Coverage
51 obs, 0 unc
82 obs, 50 unc
Scoring
Both runs used the finding_derived_v1 scoring mode with agent-powered vertex assessment and adversarial debate. Same caps were applied to both — limited center source types, no job postings, no employee reviews in the edge corpus.
Truth
0.496
Stressed: omission
Truth
0.450
Stressed: omission
Authority
0.500
Stressed: compression
Authority
0.410
Stressed: misalignment
Continuity
—
Deferred (single period)
Continuity
—
Deferred (single period)
The 0.066 gap in overall score is meaningful at this scale. But the more interesting divergence is in what each model identified as the primary stress point. Gemma 4 saw Truth under omission — the center narrative talks digital innovation but ignores labor conditions. Qwen3 saw Authority under misalignment — no single point of accountability for app or customer support.
Both reads are defensible. The same organization can have both conditions simultaneously. The question for pipeline development is whether the model should converge on the same diagnosis, or whether model-level divergence itself reveals complementary structural signals.
Skeptic Debate
Both models proposed 5 initial findings and sustained 2 through 2 rounds of adversarial debate. Identical throughput: 40%.
Debate Tokens
4,001
6,250
Gemma 4 pre-filtered 1 finding before debate, catching a weak candidate early. Qwen3 sent all 5 to the full debate process and used 56% more tokens to reach the same conclusion. In adversarial reasoning, Gemma 4 is more efficient — tighter arguments, less verbosity.
Diagnostic Depth
Sustained Findings
Gemma 4 — Truth: Alignment
Customers report the mobile app is intuitive and easy to use, aligning with the company's stated focus on digital platforms.
Strength: Moderate · Vertex: Truth · Dimension: Alignment · Scope: Product
Gemma 4 — Truth: Omission
Edge observations report frontline employees performing expanded responsibilities without corresponding title or compensation adjustments, yet the center narrative is silent on internal labor conditions.
Strength: Strong · Vertex: Truth · Dimension: Omission · Scope: Labor
Qwen3 — Truth: Omission
Numerous complaints about app and customer support issues exist, but no center claims address these support-related concerns.
Strength: Moderate · Vertex: Truth · Dimension: Omission · Scope: Product
Qwen3 — Authority: Misalignment
Operations ownership is diffused across multiple roles and teams, with no single point of accountability for app functionality or customer support, leading to inconsistent and unresolved issues.
Strength: Moderate · Vertex: Authority · Dimension: Misalignment · Scope: Operations
Gemma 4 found both an alignment (what works) and an omission (what's hidden). The labor finding — frontline employees overworked without recognition — is a deeper structural insight that Qwen3 missed entirely. Qwen3's authority finding about diffused ownership is valid but more surface-level.
Field Notes
Gemma 4 generated 8 field notes to Qwen3's 6. The additional signals Gemma 4 surfaced:
Gemma Only
Metric displacement: 1 metric mentioned against 179 complaints — performance metrics may be replacing actual outcomes.
Gemma Only
Horizon collapse: 4 short-term mentions against 40 strategy observations — planning horizon may be shrinking.
Gemma Only
Now-over-later: Long-term consequences acknowledged but not weighted in decisions.
Qwen Only
Signal classification gap: 72 unclassified segments (12% rate) — a self-diagnostic artifact, not an organizational signal.
Both
Context gap, defensive escalation, concentrated complaint pattern in product — detected by both models with comparable evidence counts.
Observation
Qwen3's unique field note — "signal classification gap" — is a reflection of its own extraction failures, not an organizational finding. Gemma 4's unique field notes (metric displacement, horizon collapse, now-over-later) are all structural signals about the organization itself. This is the clearest quality gap in the comparison.
Verdict
Pipeline Recommendation
Gemma 4 31B is the stronger pipeline model.
Faster extraction (14%), zero classification failures, higher scores, richer field notes, and a deeper diagnostic insight (the labor omission). Qwen3's volume advantage — 35% more observations — did not translate to better findings. The extra observations added noise without improving signal. The 0.066 overall score gap is meaningful at pipeline scale.
Speed
Gemma 4
14% faster end-to-end
Precision
Gemma 4
Zero unclassified segments
Volume
Qwen3
35% more observations
Scoring
Gemma 4
0.498 vs 0.432
Diagnostic Depth
Gemma 4
8 field notes, labor insight
Debate Efficiency
Gemma 4
37% fewer debate tokens
The one thing Qwen3 did differently — not better, differently — is that it stressed Authority where Gemma 4 stressed Truth. In a production context, running both models on the same entity and comparing vertex stress could surface complementary conditions. But if you're picking one model for fleet runs, Gemma 4 wins on every axis that matters for pipeline reliability.
Methodology
Pipeline
Coherence External Diagnostic Pipeline v0.1.0, agent-powered extraction and scoring
Scoring Mode
finding_derived_v1 with 2-round adversarial skeptic debate
Infrastructure
DGX Spark (NVIDIA GB10, 128GB unified), Ollama local inference
Corpus
709 documents: press releases, customer reviews, social media, news coverage. Collected 2026-03-27.
Controls
Same entity, same corpus, same prompts, same scoring pipeline. Only the inference model differed.
Execution
Both ran simultaneously on parallel DGX Sparks (Apr 7). Gemma 4 on Spark 1, Qwen3 on Spark 2. Identical hardware, concurrent execution.