01Context
Medical coders read clinical notes and assign ICD-10 diagnosis codes and CPT procedure codes. The work is slow, expensive, and prone to undercoding or miscoding. A coding service that suggests accurate codes from clinical text — ranked by confidence, with safety guardrails — saves coder time and reduces revenue leakage.
I designed and built this service end-to-end as the sole engineer at a healthcare technology company.
02Challenge
A single retrieval strategy isn't enough. BioBERT can hallucinate semantic similarity for codes that aren't clinically relevant. BM25 is brittle to phrasing — "MI" versus "myocardial infarction" reads as completely different inputs. Historical co-occurrence is stale for newer codes.
And the cost of a bad suggestion is high. These codes drive billing and downstream clinical decisions. A model that confidently suggests the wrong code is worse than no model at all.
03Approach
I built a 3-layer hybrid retrieval system, each layer covering the others' blind spots.
Historical co-occurrence
what codes have been assigned to similar notes in our historical data
BM25 lexical search
keyword search over ICD-10 and CPT descriptions via rank-bm25
BioBERT semantic
notes encoded into 768-dim vectors and indexed in FAISS for nearest-neighbour search
The layers run in parallel and merge via configurable weighted scoring. Weights live in YAML, not code, so changing the balance doesn't need a deploy.
On top of retrieval, I added clinical safety guardrails — negation detection so "patient denies chest pain" doesn't promote chest-pain codes, temporal classification to distinguish acute from historical conditions, demographic exclusions, and MedCPT score damping to prevent ranking artifacts.
The part I'm proudest of
Beyond unit tests, I set up accuracy-regression gates that run a 31-case and a 300-case clinical evaluation on every PR. They block merges that pass unit tests but regress clinical accuracy.
04Outcome
Shipped to production in under a year, solo. The accuracy gates have caught multiple regressions that traditional unit tests would have missed — refactors that were syntactically correct but broke the temporal classifier in subtle ways.
05What I'd do differently
If I were starting over, I'd put more thought into pagination strategy upfront. We have a mix of offset and cursor pagination across endpoints, and that inconsistency was a footgun more than once.