01Context
A clinical product needed a queryable source-of-truth for pharmaceutical reference data: drug interactions, dosing rules, REMS programs, allergies, prescriptions, and the workflows downstream of all of it.
The source data was delivered as roughly 300 MB across 309 interrelated CSV files — no schema document, no ER diagram, no formal description of what the data meant or how it related. Just files. The platform needed to ingest them reliably, expose them via REST APIs, and support full-text search across them.
02Challenge
The first problem wasn't writing code — it was understanding the data. 309 CSVs with no formal schema. Some column names hinted at their meaning; many didn't. Some foreign-key relationships were explicit (a column literally named drug_id); others had to be inferred from value distributions and overlaps across files.
Once the schema was understood, the second problem was ordering. drug_interactions.csv references rows in drugs.csv, which references ingredients.csv, which references units.csv. Import in the wrong order and you either crash on missing foreign keys or silently create orphaned rows. A handful of dependencies were cyclic.
And the dataset refreshes periodically. The whole pipeline had to be re-runnable — running on already-imported data should update, not duplicate.
03Approach
The work split into three phases.
Phase 1 — Discovery and schema design
I profiled every CSV: column names, data types inferred from value samples, distinct value counts, null patterns, and candidate foreign keys (detected by looking for high-overlap value sets between columns of different files). From that analysis I designed a normalized relational schema — 714 models, 1,229 FK and M2M relationships, organized into 13 bounded sub-domains. This phase produced no shipping code; it produced the foundation everything else stood on.
Phase 2 — The importer
I built a dependency-graph ETL on top of the schema:
- Parse each CSV's FK columns. Construct a directed graph of table dependencies. Topologically sort it to produce a safe import order.
- Detect cyclic dependencies at graph-build time. Resolve them by splitting the import into two passes — first pass inserts rows with the cyclic FK nullable; second pass fills in the cyclic FK after both sides exist.
- Make every insertion idempotent with
get_or_createandupdate_or_create— re-runs update rather than duplicate. - Multi-tier migration conflict recovery: partial-state rollback, clean re-migration, resume from the last successful checkpoint.
Phase 3 — The API surface
On top of the importer, I layered versioned REST APIs (v1 and v1_with_marketed_changes), JWT auth, object-level permissions (django-guardian), rate limiting, field-level encryption, and OpenAPI 3.0 docs auto-generated by drf-spectacular. 730+ endpoints across 13 sub-domains, all following the same ViewSet patterns so consumers didn't have to learn endpoint-specific conventions.
For performance: a Redis caching layer with Celery-beat-driven background cache warming, plus Elasticsearch for full-text search across the corpus.
The trick that paid for itself
The importer doubles as the test fixture loader. Seeding 700+ models manually for tests would have been impossible — the same dependency-graph that imports prod data builds the test database in seconds.
04Outcome
The platform serves 714 models across 13 sub-domains, with 1,229 foreign-key and many-to-many relationships, and 730+ REST endpoints. The entire dataset can be refreshed reproducibly with a single command.
05What I'd do differently
The dependency-graph importer is the kind of generic pattern I'd extract into a reusable package if I built another data platform. It's not pharmaceutical-specific.