Multi-Agent Pipeline — Greenbaum Labs

01Collect

02Extract

03Score

04Synthesize

JSONL

Claims + Observations

Case File

Agent 01

The Collector

Gathers raw center + edge documents from 7 public source types per entity.

CENTER press releases, job postings, earnings transcripts

EDGE customer reviews, social media, news, employee reviews

OUTPUT ~1,800 docs/entity → JSONL

DEDUP content-level, ~170 lines

TOOLS Perplexity Computer + custom collectors

Agent 02

The Extractor

Reads each document individually. Produces structured claims (center) and observations (edge).

INPUT raw JSONL, batch_size=1

CENTER → structured claims

EDGE → structured observations

OUTPUT ~6,100 claims + observations

MODULE agent_extract.py

Agents 03 – 05 · The Adversarial Core

Agent 03a

Truth Diagnostician

Scores center-edge alignment. 3 findings: alignment, omission, contradiction.

WEIGHT 55% of overall

CAP 3 findings (hard)

Agent 03b

Authority Diagnostician

Scores voice/power coherence. 3 findings: compression, diffusion, misalignment.

WEIGHT 45% of overall

CAP 3 findings (hard)

Agent 04 · The Skeptic

→Finding submitted with evidence citations

←Skeptic challenges — evidence quality, circular reasoning, unsupported confidence

→Diagnostician rebuts — defends with specific evidence

↓Verdict rendered for each of 6 findings

Sustained → Evidence Ledger

Rejected → Rejection Log

truth = alignment(+0.45) + omission(-0.30) + contradiction(-0.50)
authority = compression(-0.25) + diffusion(-0.18) + misalignment(-0.22)
overall = truth × 0.55 + authority × 0.45
strength: strong(0.40) · moderate(0.25) · weak(0.10)

0.448

σ = 0.000 (deterministic)

Agent 06

The Recorder

Logs every run to the performance ledger. Maintains evidence chains. Writes case files and summaries.

CASE FILE case_file.json

SUMMARY case_summary.md

LEDGER ledger.json (133 entries)

TOKENS aggregate_tokens.py

HYGIENE post_run_hygiene.sh

NOTIFY operator_inbox.jsonl

Agent 00 · Human

The Architect

Designs pipeline, writes prompts, debugs failures, makes all decisions. The only entity with decision authority. Agents observe and suggest. The Architect decides.

RULE AR-001: Automation may observe, summarize, and suggest. Automation may not decide.

SCOPE prompt design, architecture, failure triage, all external decisions

Agent 05

The Validator

Entity resolution, schema validation, provenance hash checks. Runs inline between extraction and scoring.

CHECK schema conformance

CHECK provenance hashes

CHECK entity resolution

GATE blocks malformed data

Autoresearch — Continuous Prompt Optimization

Track 1 — Scoring

Scoring Prompt Optimization

Optimizes Truth, Authority, Skeptic, and Rebuttal prompts against a fixed proxy entity.

PROXYNike run_079 (22 docs)

EXPERIMENTS41+

BASELINE0.058

BEST0.091 (1.57x)

TIME/EXPERIMENT~45 min (3 trials)

METRICrepro × stability × evidence × diagnostic

Track 2 — Extraction

Extraction Prompt Optimization

Optimizes source-specific extraction prompts. Key insight: fewer, sharper items beat volume.

PROXYNike stratified (15 docs)

EXPERIMENTS19+

METRICsustain_rate × mean_strength

KEY FINDING98% extracted items never cited

SOLUTIONasymmetric limits, fewer/sharper

TIME/EXPERIMENT~20 min

Three Feedback Loops

FAST

Loop 1 — Real-Time

Adversarial Debate

Every finding is challenged in real-time. The Skeptic forces the Diagnostician to defend each claim with specific evidence. Findings that can't survive debate are rejected before they ever reach the score. This loop runs 6 times per entity — once per finding. It is the structural mechanism that prevents the system from producing conclusions it cannot defend.

Diagnostician → Finding + Evidence → Skeptic → Challenge ← Diagnostician → Rebuttal → Verdict

Cycle Time

~30 sec / finding

Frequency

6x per run (3 Truth + 3 Authority)

Healthy Sustain Rate

50–75%

What it Prevents

Unsupported conclusions, circular reasoning

MED

Loop 2 — Continuous

Autoresearch Optimization

The Recorder logs run performance to the ledger — sustain rates, evidence quality, score distributions. Autoresearch reads those metrics and evaluates prompt variations against a fixed proxy. Winning prompts are written back into the scoring and extraction prompt files, changing how every agent behaves on the next run. This is the loop that turned a 17% skeptic sustain rate into 50%.

Recorder → Ledger Metrics → Autoresearch → Prompt Experiment → Eval Harness ← scoring_prompts.py ← extractor_prompts.py

Cycle Time

~45 min / experiment

Experiments Run

60+ across both tracks

Scoring Improvement

11.5x (baseline → production)

What it Prevents

Prompt stagnation, extraction waste

SLOW

Loop 3 — Architectural

The Architect’s Design Loop

The Recorder surfaces patterns that no prompt change can fix — score drift, non-determinism, structural calibration failures. The Architect reads these, diagnoses the root cause, and modifies the pipeline architecture itself. This loop is slow but produces the biggest changes. Hard finding caps, seeded sampling, finding-derived scoring, and the skeptic debate itself all emerged from this loop. It is the reason stdev went from 0.114 to 0.000.

Recorder → Pattern / Anomaly → The Architect → Diagnosis → Architecture Change ← Pipeline

Cycle Time

Days to weeks

Major Changes

Finding-derived scoring, skeptic debate, hard caps, seeded sampling

Impact

Stdev 0.114 → 0.000

What it Prevents

Structural failures no prompt can fix

The Coherence Diagnostic Pipeline