Skip to main content
Back to work
2025 — Present · Sole engineer · Healthcare tech company

Pharmaceutical Reference Data Platform

Reverse-engineered the data model from 309 raw CSV files, designed a 714-model relational schema, and built a Django/DRF platform serving 730+ versioned REST endpoints on top of it.

python manage.py import_pharma --csv-dir=./data
analyzing 309 csvs ...
↳ inferring schema · types · null patterns
↳ detecting FK candidates by value overlap
building dependency graph
↳ topological sort · 714 nodes · 1,229 edges
↳ 4 cycles detected → 2-pass strategy
importing ...
✓ units.csv · 142 rows
✓ ingredients.csv · 8,407 rows
✓ drugs.csv · 24,318 rows
✓ interactions.csv · 412,884 rows
✓ 309/309 csvs · 0 orphans
elapsed: 4m 18s · idempotent
714 nodes · 1,229 edges · 13 sub-domains

01Context

A clinical product needed a queryable source-of-truth for pharmaceutical reference data: drug interactions, dosing rules, REMS programs, allergies, prescriptions, and the workflows downstream of all of it.

The source data was delivered as roughly 300 MB across 309 interrelated CSV files — no schema document, no ER diagram, no formal description of what the data meant or how it related. Just files. The platform needed to ingest them reliably, expose them via REST APIs, and support full-text search across them.

02Challenge

The first problem wasn't writing code — it was understanding the data. 309 CSVs with no formal schema. Some column names hinted at their meaning; many didn't. Some foreign-key relationships were explicit (a column literally named drug_id); others had to be inferred from value distributions and overlaps across files.

Once the schema was understood, the second problem was ordering. drug_interactions.csv references rows in drugs.csv, which references ingredients.csv, which references units.csv. Import in the wrong order and you either crash on missing foreign keys or silently create orphaned rows. A handful of dependencies were cyclic.

And the dataset refreshes periodically. The whole pipeline had to be re-runnable — running on already-imported data should update, not duplicate.

03Approach

The work split into three phases.

Phase 1 — Discovery and schema design

I profiled every CSV: column names, data types inferred from value samples, distinct value counts, null patterns, and candidate foreign keys (detected by looking for high-overlap value sets between columns of different files). From that analysis I designed a normalized relational schema — 714 models, 1,229 FK and M2M relationships, organized into 13 bounded sub-domains. This phase produced no shipping code; it produced the foundation everything else stood on.

Phase 2 — The importer

I built a dependency-graph ETL on top of the schema:

  • Parse each CSV's FK columns. Construct a directed graph of table dependencies. Topologically sort it to produce a safe import order.
  • Detect cyclic dependencies at graph-build time. Resolve them by splitting the import into two passes — first pass inserts rows with the cyclic FK nullable; second pass fills in the cyclic FK after both sides exist.
  • Make every insertion idempotent with get_or_create and update_or_create — re-runs update rather than duplicate.
  • Multi-tier migration conflict recovery: partial-state rollback, clean re-migration, resume from the last successful checkpoint.

Phase 3 — The API surface

On top of the importer, I layered versioned REST APIs (v1 and v1_with_marketed_changes), JWT auth, object-level permissions (django-guardian), rate limiting, field-level encryption, and OpenAPI 3.0 docs auto-generated by drf-spectacular. 730+ endpoints across 13 sub-domains, all following the same ViewSet patterns so consumers didn't have to learn endpoint-specific conventions.

For performance: a Redis caching layer with Celery-beat-driven background cache warming, plus Elasticsearch for full-text search across the corpus.

The trick that paid for itself

The importer doubles as the test fixture loader. Seeding 700+ models manually for tests would have been impossible — the same dependency-graph that imports prod data builds the test database in seconds.

04Outcome

The platform serves 714 models across 13 sub-domains, with 1,229 foreign-key and many-to-many relationships, and 730+ REST endpoints. The entire dataset can be refreshed reproducibly with a single command.

05What I'd do differently

The dependency-graph importer is the kind of generic pattern I'd extract into a reusable package if I built another data platform. It's not pharmaceutical-specific.