Getting the boring stuff right with your data

Most of what makes real-world data valuable happens before any model runs: whether you can trace it, link it, and ship it the same way every month. A model built on data that fails those tests still returns confident answers. They're just wrong, and they carry the data's flaws forward. So I pay attention to the order things get built in. Below is the order I'd want, and what I'd want signed off at each step before moving to the next. Early notes, meant to start a conversation.

The order I'd build in
01Governance & provenance

Before anything is built, the data has to be allowed to exist and be traceable to where it came from. De-identification, lineage, version control, and a data dictionary someone can actually read. None of this shows up in a demo, and all of it is what everything above it depends on.

  • A current HIPAA de-identification determination
  • End-to-end lineage and version control
  • A published data dictionary and known-limitations doc
02Structure & linkage

Then the basics of quality — does the data conform, is it complete, is it plausible — and the part everything else leans on: linkage, resolving the same patient across sources and over time. If linkage is wrong, every patient-level analysis is wrong with it. So I'd measure the match rate, not assume it.

  • Quality dimensions measured, none sitting at zero
  • A stable de-identified patient token
  • Match and false-match rates estimated, not asserted
03Reliable delivery

A dataset people can build on is one that refreshes on schedule without anyone staying late. The cycle is automated, it lands on its committed date, and it doesn't depend on one person being at their desk. If delivery only works because a few people perform heroics every month, that isn't a system yet.

  • An automated build more than one person can run
  • A committed calendar with a measured on-time record
  • A QC gate before every release, with sign-off
04Monitoring & continuity

The last piece is what keeps the data correct after it ships: checks that run on every refresh, alerts when something moves, a comparison against last month, and the ability to reproduce any past delivery exactly. Quality here is something that runs continuously, not a one-time check.

  • Automated checks and alerting on every refresh
  • Refresh-over-refresh reconciliation
  • Any past delivery reproducible from versioned code
Then the AI layer

With those four in place, the AI work has something solid under it, and it can be a real advantage instead of a faster way to produce wrong answers. A few things worth building once the data supports them:

Patient journeys

Diagnosis pathways, treatment sequences, lines of therapy. Only as trustworthy as the linkage they stand on.

rests on · structure & linkage
Predictive agents

Patient-finding, adherence and discontinuation risk. Trustworthy only because the data they learn from is.

rests on · the whole stack
Governed data access

Exposing the asset to a customer's own agents, with role-based access and audit logging. Safe to do only once the data is governed.

rests on · governance