Model Race: Gemma 4 vs Qwen3 — Global Restaurant Chain

At a Glance

Entity

Global Restaurant Chain

anonymized

Corpus Size

709

7 center + 593 edge documents

Data Quality

No employee reviews or job postings

Gemma 4 — 31B

Qwen3 — 32B

Overall Coherence

0.498

Overall Coherence

0.432

Total Runtime

4h 36m

16,578 seconds

Total Runtime

5h 12m

18,753 seconds

Most Stressed Vertex

Truth

Omission

Most Stressed Vertex

Authority

Misalignment

Both models identified real structural conditions in the same organization, but they stressed different vertices and told different stories about why. That divergence is itself a finding.

Speed

Gemma 4 was consistently faster across every pipeline phase except synthesis. On a DGX Spark running Ollama, it processed the same 709 documents 14% faster end-to-end.

Gemma 4

4h 36m

Qwen3

5h 12m

Extraction

4h 29m

5h 04m

Scoring

4m 19s

6m 05s

Synthesis

2m 39s

1m 59s

Total Tokens

1,065K

1,037K

Effective t/s

~66 t/s

~57 t/s

Gemma 4's speed advantage comes from extraction — the longest phase. It processes documents faster per-token despite generating slightly more total tokens. Qwen3 was faster only at synthesis, where it produced a more compressed case summary.

Extraction Quality

Both models extracted the same number of claims (23) from the 7 center documents. The divergence is in edge extraction and classification discipline.

Claims

Observations

1,490

2,014

Unclassified

72 (3.4%)

Total Segments

1,513

2,109

Qwen3 extracted 35% more observations, particularly from customer reviews (985 vs 782) and social media (947 vs 657). But it also produced 72 segments it couldn't classify — all in social media and news. Gemma 4 had zero classification failures across the entire corpus.

More extraction isn't necessarily better. The question is whether the additional volume improves diagnostic signal or adds noise. In this case, Qwen3's extra observations did not produce more sustained findings or higher scores. The additional volume made the corpus noisier without sharpening the diagnosis.

Source Breakdown

Press Releases

14 claims

17 claims

Customer Reviews

782 obs

985 obs

Social Media

657 obs, 0 unc

947 obs, 22 unc

News Coverage

51 obs, 0 unc

82 obs, 50 unc

Scoring

Both runs used the finding_derived_v1 scoring mode with agent-powered vertex assessment and adversarial debate. Same caps were applied to both — limited center source types, no job postings, no employee reviews in the edge corpus.

Gemma 4 Scores

Qwen3 Scores

Truth

0.496

Stressed: omission

Truth

0.450

Stressed: omission

Authority

0.500

Stressed: compression

Authority

0.410

Stressed: misalignment

Continuity

—

Deferred (single period)

Continuity

—

Deferred (single period)

The 0.066 gap in overall score is meaningful at this scale. But the more interesting divergence is in what each model identified as the primary stress point. Gemma 4 saw Truth under omission — the center narrative talks digital innovation but ignores labor conditions. Qwen3 saw Authority under misalignment — no single point of accountability for app or customer support.

Both reads are defensible. The same organization can have both conditions simultaneously. The question for pipeline development is whether the model should converge on the same diagnosis, or whether model-level divergence itself reveals complementary structural signals.

Skeptic Debate

Both models proposed 5 initial findings and sustained 2 through 2 rounds of adversarial debate. Identical throughput: 40%.

Initial Findings

Pre-filtered

Sustained

Rejected

Debate Tokens

4,001

6,250

Debate Time

107s

162s

Gemma 4 pre-filtered 1 finding before debate, catching a weak candidate early. Qwen3 sent all 5 to the full debate process and used 56% more tokens to reach the same conclusion. In adversarial reasoning, Gemma 4 is more efficient — tighter arguments, less verbosity.

Diagnostic Depth

Sustained Findings

Gemma 4 — Truth: Alignment

Customers report the mobile app is intuitive and easy to use, aligning with the company's stated focus on digital platforms.

Strength: Moderate · Vertex: Truth · Dimension: Alignment · Scope: Product

Gemma 4 — Truth: Omission

Edge observations report frontline employees performing expanded responsibilities without corresponding title or compensation adjustments, yet the center narrative is silent on internal labor conditions.

Strength: Strong · Vertex: Truth · Dimension: Omission · Scope: Labor

Qwen3 — Truth: Omission

Numerous complaints about app and customer support issues exist, but no center claims address these support-related concerns.

Strength: Moderate · Vertex: Truth · Dimension: Omission · Scope: Product

Qwen3 — Authority: Misalignment

Operations ownership is diffused across multiple roles and teams, with no single point of accountability for app functionality or customer support, leading to inconsistent and unresolved issues.

Strength: Moderate · Vertex: Authority · Dimension: Misalignment · Scope: Operations

Gemma 4 found both an alignment (what works) and an omission (what's hidden). The labor finding — frontline employees overworked without recognition — is a deeper structural insight that Qwen3 missed entirely. Qwen3's authority finding about diffused ownership is valid but more surface-level.

Field Notes

Gemma 4 generated 8 field notes to Qwen3's 6. The additional signals Gemma 4 surfaced:

Gemma Only

Metric displacement: 1 metric mentioned against 179 complaints — performance metrics may be replacing actual outcomes.

Gemma Only

Horizon collapse: 4 short-term mentions against 40 strategy observations — planning horizon may be shrinking.

Gemma Only

Now-over-later: Long-term consequences acknowledged but not weighted in decisions.

Qwen Only

Signal classification gap: 72 unclassified segments (12% rate) — a self-diagnostic artifact, not an organizational signal.

Both

Context gap, defensive escalation, concentrated complaint pattern in product — detected by both models with comparable evidence counts.

Observation

Qwen3's unique field note — "signal classification gap" — is a reflection of its own extraction failures, not an organizational finding. Gemma 4's unique field notes (metric displacement, horizon collapse, now-over-later) are all structural signals about the organization itself. This is the clearest quality gap in the comparison.

Verdict

Pipeline Recommendation

Gemma 4 31B is the stronger pipeline model.

Faster extraction (14%), zero classification failures, higher scores, richer field notes, and a deeper diagnostic insight (the labor omission). Qwen3's volume advantage — 35% more observations — did not translate to better findings. The extra observations added noise without improving signal. The 0.066 overall score gap is meaningful at pipeline scale.

Speed

Gemma 4

14% faster end-to-end

Precision

Gemma 4

Zero unclassified segments

Volume

Qwen3

35% more observations

Scoring

Gemma 4

0.498 vs 0.432

Diagnostic Depth

Gemma 4

8 field notes, labor insight

Debate Efficiency

Gemma 4

37% fewer debate tokens

The one thing Qwen3 did differently — not better, differently — is that it stressed Authority where Gemma 4 stressed Truth. In a production context, running both models on the same entity and comparing vertex stress could surface complementary conditions. But if you're picking one model for fleet runs, Gemma 4 wins on every axis that matters for pipeline reliability.

Methodology

Pipeline

Coherence External Diagnostic Pipeline v0.1.0, agent-powered extraction and scoring

Scoring Mode

finding_derived_v1 with 2-round adversarial skeptic debate

Infrastructure

DGX Spark (NVIDIA GB10, 128GB unified), Ollama local inference

Corpus

709 documents: press releases, customer reviews, social media, news coverage. Collected 2026-03-27.

Controls

Same entity, same corpus, same prompts, same scoring pipeline. Only the inference model differed.

Execution

Both ran simultaneously on parallel DGX Sparks (Apr 7). Gemma 4 on Spark 1, Qwen3 on Spark 2. Identical hardware, concurrent execution.

Gemma 4 vs Qwen3on Global Restaurant Chain

At a Glance

Speed

Extraction Quality

Source Breakdown

Scoring

Skeptic Debate

Diagnostic Depth

Sustained Findings

Field Notes

Verdict

Methodology

Gemma 4 vs Qwen3
on Global Restaurant Chain