Data Strategy

Our Data Strategy, Published

How we chose prompting over fine-tuning, why collection is the ballgame, and what finding-derived scoring changed about determinism.

PromptingOur bucket
0.000Score stdev
150K+Embedded photos
133Pipeline runs

“Companies should publish their data strategy.”

Most won't. Data strategy lives in internal decks that nobody reads twice, or worse, in the heads of three engineers who haven't written anything down. Publishing forces clarity. If you can't explain your data strategy to someone outside your organization, you probably don't have one — you have a collection of habits.

So here's ours.

Every AI system makes a fundamental choice

Every AI system chooses across three buckets — training, fine-tuning, or prompting. We chose prompting. Deliberately.

Training
Not our bucket
Build a model from scratch. Massive data, massive compute, a team that debugs gradient descent at 3 AM. Unless you're a frontier lab, this isn't your bucket.
Fine-tuning
Evaluated, rejected
Specialize a foundation model on your data. Domain expertise baked into weights — but also brittleness. Fine-tuned models drift and hallucinate confidently just outside their training distribution.
Prompting
Our bucket
Keep the model general. Make your data strategy about what goes into the context window. The model is a reasoning engine; your job is to give it the right evidence at the right time.

The thing we're building — a diagnostic system that examines how organizations make decisions and where those decisions structurally break down — requires the model to reason about novel situations with fresh evidence. We don't want a model that has "learned" what organizational dysfunction looks like and pattern-matches against its training. We want a model that looks at this company's public evidence this week and draws conclusions from what it actually finds.

Prompting keeps the reasoning honest. The model can't fall back on memorized patterns. It has to show its work with the evidence we give it. That choice cascades into everything else.

If prompting is the bucket, collection is the ballgame

The quality of what goes into the context window determines the quality of what comes out. We collect from multiple public source types per entity: regulatory filings, press releases, job postings, customer reviews, employee reviews, social media, news coverage. Each source type has a different collector with its own ingestion logic and deduplication.

A company's job postings tell you something its press releases never will. Customer complaints reveal patterns that earnings calls are designed to obscure.

Principle 01

Multi-source diversity

No single source type gets to dominate the evidence base. We cap each source during scoring so that an entity with 1,200 customer reviews and 10 press releases doesn't become a customer-review-only analysis.

Principle 02

Deduplication at ingest

The same content appears across sources — a press release quoted in a news article, a Reddit post screenshotted on Twitter. Content dedup catches this before it enters the pipeline. Duplicate evidence doesn't make a finding stronger; it makes the system overconfident.

Principle 03

Freshness matters

We re-collect before every analysis run. Stale data is a structural risk. An organization's public signals shift week to week — a job posting that appeared Friday and disappeared Monday is a signal. A regulatory filing from six months ago is context, not evidence.

Finding-derived, adversarial, deterministic

Here's where most people's data strategies stop: collect data, throw it at a model, get a number. We spent months learning why that doesn't work.

Early versions of our system asked an AI agent to read the evidence and produce a score — a float between 0 and 1. The problem: that float was non-deterministic. Same evidence, same model, same temperature set to zero, different number every time. Unacceptable for anything you'd want to stand behind.

The solution was to stop trusting the model to produce numbers and start trusting it to produce findings — structured observations about specific patterns in the evidence. The model is good at reading a document and saying "this company's job postings contradict its public messaging about AI investment." The model is bad at deciding whether that's a 0.37 or a 0.42.

Finding-Derived Scoring

Agents produce findings. A skeptic agent challenges each one. Findings that survive adversarial debate are "sustained." The score is computed deterministically from sustained findings using calibrated weights. Three independent runs of the same evidence now produce identical scores. Stdev: 0.000.

0.000
Score stdev across runs
3
Findings per dimension (hard cap)
50%
Avg skeptic throughput
Design Choice 01

Hard finding caps

Each analysis produces exactly three findings per dimension — no more, no less. Without this, the model would produce anywhere from two to six findings per run, creating a different debate surface each time. Fixed count means fixed surface means reproducible outcomes.

Design Choice 02

Seeded sampling

When the evidence base is too large for the context window, we sample. That sample is seeded from the entity and source metadata, so the same input always selects the same documents.

Design Choice 03

Skeptic throughput as a quality metric

We track what percentage of findings survive the adversarial debate. A run where everything survives isn't rigorous. A run where nothing survives has bad extraction. The sweet spot is in the middle.

Visual and textual, separate concerns

We maintain two embedding strategies for two different problems.

For our visual corpus (150,000+ photographs), we use CLIP embeddings — 768-dimensional vectors that let us search images by concept rather than metadata. "Show me photos that feel like isolation" returns something. The visual index exists to connect a diagnostic concept to an image that illustrates it, without anyone having to manually tag 150,000 files.

For the diagnostic pipeline, we use ChromaDB for retrieval over extracted evidence. The key decision was keeping the embedding index per entity, per collection date. No cross-contamination between entities. No stale embeddings from last month's collection mixed with this week's evidence.

The boring part that matters most

Data strategy without governance is just data hoarding.

Discipline 01

Experiment discipline

Every change to the pipeline — code or prompt — is tracked separately. We learned this the hard way when unlabeled changes across both layers silently invalidated 41 experiments of prompt optimization. The baseline shifted and nobody noticed until scores collapsed. Now: code changes are tested independently. Prompt changes live on the experiment surface. Never mix both in the same commit.

Discipline 02

Automated prompt optimization

We run a continuous evaluation harness that tests prompt variations against a fixed proxy dataset. The metric isn't "does the score go up" — it's a composite of reproducibility, stability, evidence quality, and diagnostic value. A prompt that produces high scores but unreproducible findings is worse than one that produces moderate scores reliably.

Discipline 03

Post-run hygiene

Every completed run triggers automated ingest into a performance ledger, token aggregation, and scorecard generation. The system of record updates itself. That operational history is the governance.

What we don't publish

Transparency has limits.

The strategy is the architecture of decisions: why prompting over fine-tuning, why adversarial debate over single-pass scoring, why deterministic derivation over model-generated floats. Those decisions generalize. The specific implementation is ours.

Write it down. Publish it.

If you're building an AI system and you haven't written down your data strategy, start with the three buckets. Pick one. Then follow the implications all the way through: if you pick prompting, your collection architecture becomes your competitive advantage. If you pick fine-tuning, your data curation pipeline is everything. If you pick training, you're probably not reading this.

Whatever you pick, write it down. Publish it. The act of making it legible to outsiders is the act of making it legible to yourself.

And if your data strategy is “we put everything in a vector database and hope for the best” — that's not a strategy. That's a prayer.

See this strategy in practice

The Coherence Diagnostic Engine applies this data strategy to measure the structural gap between what organizations say and what they do.

Read the Case Study