Most of what makes real-world data valuable happens before any model runs: whether you can trace it, link it, and ship it the same way every month. A model built on data that fails those tests still returns confident answers. They're just wrong, and they carry the data's flaws forward. So I pay attention to the order things get built in. Below is the order I'd want, and what I'd want signed off at each step before moving to the next. Early notes, meant to start a conversation.
Before anything is built, the data has to be allowed to exist and be traceable to where it came from. De-identification, lineage, version control, and a data dictionary someone can actually read. None of this shows up in a demo, and all of it is what everything above it depends on.
Then the basics of quality — does the data conform, is it complete, is it plausible — and the part everything else leans on: linkage, resolving the same patient across sources and over time. If linkage is wrong, every patient-level analysis is wrong with it. So I'd measure the match rate, not assume it.
A dataset people can build on is one that refreshes on schedule without anyone staying late. The cycle is automated, it lands on its committed date, and it doesn't depend on one person being at their desk. If delivery only works because a few people perform heroics every month, that isn't a system yet.
The last piece is what keeps the data correct after it ships: checks that run on every refresh, alerts when something moves, a comparison against last month, and the ability to reproduce any past delivery exactly. Quality here is something that runs continuously, not a one-time check.
With those four in place, the AI work has something solid under it, and it can be a real advantage instead of a faster way to produce wrong answers. A few things worth building once the data supports them:
Diagnosis pathways, treatment sequences, lines of therapy. Only as trustworthy as the linkage they stand on.
rests on · structure & linkagePatient-finding, adherence and discontinuation risk. Trustworthy only because the data they learn from is.
rests on · the whole stackExposing the asset to a customer's own agents, with role-based access and audit logging. Safe to do only once the data is governed.
rests on · governance