Clinical AI Coding Service

01Context

Medical coders read clinical notes and assign ICD-10 diagnosis codes and CPT procedure codes. The work is slow, expensive, and prone to undercoding or miscoding. A coding service that suggests accurate codes from clinical text — ranked by confidence, with safety guardrails — saves coder time and reduces revenue leakage.

I designed and built this service end-to-end as the sole engineer at a healthcare technology company.

02Challenge

A single retrieval strategy isn't enough. BioBERT can hallucinate semantic similarity for codes that aren't clinically relevant. BM25 is brittle to phrasing — "MI" versus "myocardial infarction" reads as completely different inputs. Historical co-occurrence is stale for newer codes.

And the cost of a bad suggestion is high. These codes drive billing and downstream clinical decisions. A model that confidently suggests the wrong code is worse than no model at all.

03Approach

I built a 3-layer hybrid retrieval system, each layer covering the others' blind spots.

40%

Historical co-occurrence

what codes have been assigned to similar notes in our historical data

25%

BM25 lexical search

keyword search over ICD-10 and CPT descriptions via rank-bm25

35%

BioBERT semantic

notes encoded into 768-dim vectors and indexed in FAISS for nearest-neighbour search

The layers run in parallel and merge via configurable weighted scoring. Weights live in YAML, not code, so changing the balance doesn't need a deploy.

On top of retrieval, I added clinical safety guardrails — negation detection so "patient denies chest pain" doesn't promote chest-pain codes, temporal classification to distinguish acute from historical conditions, demographic exclusions, and MedCPT score damping to prevent ranking artifacts.

The part I'm proudest of

Beyond unit tests, I set up accuracy-regression gates that run a 31-case and a 300-case clinical evaluation on every PR. They block merges that pass unit tests but regress clinical accuracy.

04Outcome

Shipped to production in under a year, solo. The accuracy gates have caught multiple regressions that traditional unit tests would have missed — refactors that were syntactically correct but broke the temporal classifier in subtle ways.

05What I'd do differently

If I were starting over, I'd put more thought into pagination strategy upfront. We have a mix of offset and cursor pagination across endpoints, and that inconsistency was a footgun more than once.