How the Instrument Works — Greenbaum Labs

Design Philosophy

Core Principle

The lens is sophisticated because it needs to be. The math is simple because it should be.

Most diagnostic systems fail in one of two ways: they make interpretation simple and scoring complex, or they make both complex. The Coherence Scan separates these concerns deliberately.

Interpretation is hard. Extracting meaningful signal from SEC filings, employee reviews, customer complaints, and job postings requires a sophisticated lens. AI agents do this work, and an adversarial process keeps them honest.

Scoring is simple. Once findings survive debate, the math that converts them into scores is deterministic, locked, and auditable. No interpretation happens at the scoring layer. No weights shift between runs. The formula was calibrated against 74 scored runs and frozen.

This separation is the instrument's integrity mechanism. The agents can hallucinate or miss nuance — that's why the skeptic exists. But once a finding is sustained, the score it produces is the same score it will always produce. No drift. No judgment calls. No "it depends."

The Pipeline

Every scan follows the same five-stage process. No stage is optional. No stage can be skipped.

Collect

→

Extract

→

Score

→

Debate

→

Synthesize

Collect

Automated collection from public sources: SEC filings, earnings transcripts, job postings, employee reviews, customer reviews, social media, news coverage, and regulatory signals. No proprietary data. No inside sources. No surveys.

Extract

AI agents read every document and extract two types of signal: claims (what the organization says about itself) and observations (what the world says about the organization). Each extraction is tagged with source, scope, polarity, and a traceable hash.

Score

Specialized agents assess Truth and Authority by comparing claims against observations. They produce structured findings with dimension classification, strength rating, and supporting evidence references. Scoring is deterministic from findings — the formula is locked.

Debate

An adversarial skeptic agent challenges every proposed finding across two rounds. It identifies logical flaws, evidence gaps, and unsupported conclusions. Only findings that survive both rounds enter the final assessment.

Each stage produces artifacts that feed the next. Nothing is inferred. Nothing is assumed. The synthesizer writes the final assessment from sustained findings only — it cannot reference anything the skeptic rejected.

Evidence Lineage

Every number in a Coherence Scan traces back to a specific source document. The chain is fully auditable.

Source Document

A public document — an SEC filing, a customer review, an employee review, a job posting. Identified by source type, retrieval date, and content hash.

Extraction

AI agent reads the document and extracts claims or observations. Each is assigned an ID, classification, and excerpt hash that ties it to the specific passage in the source.

Finding

A scoring agent compares claims and observations, identifies a pattern (alignment, contradiction, omission, compression, diffusion, misalignment), and proposes a finding with references to the specific claims and observations that support it.

Debate

The skeptic challenges the finding. If sustained, the finding enters the evidence ledger with the skeptic's reasoning preserved. If rejected, the rejection reason and evidence gap are recorded.

Score

The sustained finding's dimension and strength feed the deterministic formula. The contribution to the final score is computed mechanically. No interpretation at this layer.

If a buyer asks "why is the Authority score 0.38?" — the answer traces from score, to sustained finding, to specific claims and observations, to the exact passage in the source document. The chain does not break.

The Adversarial Skeptic

The skeptic is not a quality check. It is a structural feature of the instrument. It exists because the agents that generate findings can be wrong.

How it works

Round 1: Challenge. The skeptic reviews each proposed finding and issues a verdict: sustained (evidence holds), weakened (evidence is thin but not indefensible), or rejected (evidence does not support the claim).

Round 2: Rebuttal. Weakened findings get a second chance. The skeptic reviews them again with the original challenge in view. Final verdict: sustained or rejected. No appeals.

What the skeptic catches

Insufficient breadth

"The finding cites only 5 of 47 complaints. Are these representative or cherry-picked?"

Logical inversion

"The finding treats quality control issues as evidence of improvement. These observations contradict the claim."

Missing context

"Account freezes could be temporary or situational, not systemic. Frequency data is needed."

Aspirational conflation

"The mission may be aspirational. Operational constraints may exist. The misalignment is plausible but not proven."

In a recent scan, the skeptic rejected 4 of 6 proposed findings. That is the instrument working correctly. The rejection rate is published in every scan because what the instrument chose not to assert is as important as what it did.

Evidence minimums

Findings must meet minimum evidence thresholds before the skeptic even sees them. The thresholds vary by finding type — contradictions require more supporting evidence than alignments, because the claim is stronger. Findings below the threshold are automatically excluded. The exact thresholds are part of the proprietary scoring model.

Deterministic Scoring

Once findings survive debate, the math is mechanical.

Each sustained finding has two properties that determine its contribution to the score:

Dimension

Truth: alignment (positive), omission (negative), contradiction (negative). Authority: compression (negative), diffusion (negative), misalignment (negative). Each dimension has a fixed base value.

Strength

Strong, moderate, or weak — determined by the scoring agent based on evidence density and specificity. Each strength level has a fixed weight.

The score computation is: contribution = base value × strength weight, summed across all sustained findings, normalized, and bounded. A sparse-finding dampener prevents single findings from dominating. No findings in a dimension = neutral score (0.50), not a good score.

The overall coherence score is a weighted average of Truth, Authority, and Continuity. The vertex weights and dimension base values were calibrated against 74 agent-scored runs and have been frozen since calibration. They do not change between scans, between entities, or between operators.

What this means

Give two operators the same sustained findings, and they will produce the same score. The scoring layer has zero degrees of freedom.

Confidence: What the Instrument Trusts

Every score comes with a confidence rating. This is not a subjective assessment — it is computed from five measurable properties of the evidence base:

Evidence count. How many claims and observations support the assessment. More evidence = higher confidence.
Source diversity. How many distinct source types contributed. A scan drawing from 7 source types is more reliable than one drawing from 2.
Recency. How current is the evidence. Stale data reduces confidence.
Internal consistency. Do the findings contradict each other? Contradicting findings reduce confidence. The penalty is capped — some contradiction is expected in complex organizations.
Traceability. What fraction of evidence has a verifiable link to a source document. Untraceable evidence is weighted less.

The confidence score is the minimum across all assessed vertices. If Truth confidence is 0.85 but Authority confidence is 0.60, the overall confidence is 0.60. The instrument reports its weakest link, not its strongest.

A confidence of 0.78 means: "The evidence base is strong across multiple source types with high traceability and low internal contradiction." A confidence of 0.50 means: "I'm showing you what I see, but I need more data to be certain."

Caps and Guards

The instrument limits its own claims. When data quality is insufficient, scores are automatically capped regardless of what the agents found.

Source coverage caps. If the scan lacks minimum center or edge source diversity, scores cannot exceed a defined ceiling and confidence is reduced. The instrument will not produce a confident diagnosis from narrow data.
Evidence density caps. If total evidence (claims + observations) falls below a minimum threshold, the overall score is capped and confidence is downgraded. Sparse evidence cannot produce precise scores.
Temporal depth caps. Continuity scoring requires multiple collection periods. Without sufficient longitudinal data, the continuity vertex is scored from observed trajectory patterns rather than the full continuity formula.
Pre-debate rejection. Findings that do not meet minimum evidence thresholds are excluded before the skeptic debate begins. They never enter the scoring pipeline.
Sustain rate monitoring. If the skeptic rejects more than 80% of proposed findings, the system flags a data quality warning. This prevents low-quality evidence from producing overconfident assessments.

These constraints exist because the most dangerous thing a diagnostic instrument can do is claim precision it hasn't earned. The caps are not a limitation. They are a design choice.

Reproducibility

A diagnostic instrument must produce consistent results. We tested this by scanning the same organization seven times across four collection periods as data quality improved from Grade C to Grade A.

Period	Truth	Authority	Overall	Grade	Primary Signal
1	0.45	0.38	0.42	C	Authority compression detected
2	0.33	0.55	0.43	B	Multiple failure modes cascade
3	0.58	0.38	0.49	A	Authority crisis — lowest score
4	0.44	0.50	0.47	A	Stress migrates to Truth

Scores moved. The structural diagnosis persisted. Authority compression was detected in the majority of runs. When data quality improved (Grade C to A), scoring precision increased but the underlying story remained consistent. That is how a real diagnostic instrument behaves: the readings sharpen as measurement quality improves, but the condition it's detecting doesn't change because the measurement got better.

Across a 15-entity fleet diagnostic, failure modes showed sector-level patterns: the same structural condition appeared in all three fintech entities analyzed, while aerospace and defense entities showed a different structural profile. The instrument detects structure, not noise.

Failure Mode Detection

The 17 failure modes in the DRI™ taxonomy are the structural conditions the instrument detects. Each has:

A tier classification. Foundational (Tier 1), Systemic (Tier 2), or Terminal (Tier 3). Higher tiers require stronger evidence to activate.
An activation threshold. A minimum confidence score that must be exceeded before the failure mode is reported as active. The thresholds are tiered — terminal failure modes have higher activation thresholds than foundational ones because the claim is more consequential.
Cascade relationships. Defined pathways through which one failure mode produces conditions that activate another. These are structural, not statistical — they are derived from the taxonomy, not from correlations in the data.
Precursor signals (field notes). 21 early-warning indicators that track patterns below the failure-mode threshold. These are the leading indicators — the conditions that, if left unaddressed, will cross the activation threshold.

The activation thresholds, cascade definitions, and precursor mappings are part of the proprietary scoring model.

What We Don't Share

The architecture of this instrument is transparent. The parameters are proprietary.

Published

The pipeline stages, the evidence chain structure, the adversarial debate process, the confidence model components, the scoring philosophy, the cap and guard system, the reproducibility evidence, and the full 17-failure-mode taxonomy.

Proprietary

The dimension base values, strength weights, vertex weights, sparse-finding dampener formula, confidence component weights, failure mode activation thresholds, evidence minimum thresholds, score cap triggers, and calibration dataset.

This is the same boundary that defines every serious measurement system. FICO publishes that credit scores use payment history, utilization, length of history, new credit, and credit mix. It does not publish the weights. The architecture builds trust. The parameters are the instrument.