Normalization
Why Normalization?
Section titled “Why Normalization?”Medical documents use different names for the same concept depending on language, provider, and region:
- “Hemoglobin”, “Haemoglobin”, “Hämoglobin”, “Haemoglobin (Hb)” are all the same lab test
- “Cardiology”, “Kardiologie”, “Cardiologie” are the same specialty
- “Ibuprofen”, “Brufen”, “Advil” refer to the same medication
Normalization maps these variations to a single canonical form. That gives you:
- Trend tracking that actually works across labs that name the same test differently
- Consistent filtering by specialty or diagnosis
- Cross-language search and reporting
Normalization categories
Section titled “Normalization categories”Asclepius normalizes six categories of medical data:
| Category | Canonical Table | Alias Table | Standard Code |
|---|---|---|---|
| Lab Tests | norm_lab_tests | norm_lab_test_aliases | LOINC |
| Specialties | norm_specialties | norm_specialty_aliases | — |
| Diagnoses | norm_diagnoses | norm_diagnosis_aliases | ICD-10 |
| Medications | norm_medications | norm_medication_aliases | ATC |
| Doctors | doctors | doctor_aliases | — |
| Facilities | facilities | facility_aliases | — |
Each canonical entry has:
- Canonical code — a unique identifier (e.g.,
HEMOGLOBIN,CARDIOLOGY, or a slug likedr-hans-mueller) - Canonical display name — human-readable name in English
- Standard code (where applicable) — LOINC, ICD-10, or ATC code
- Aliases — multiple alternative names in different languages
Doctors and facilities
Section titled “Doctors and facilities”Doctors and facilities use the same normalization system as medical concepts. The existing doctors and facilities tables have been extended with canonical_code and canonical_display columns, and new alias tables (doctor_aliases, facility_aliases) enable the same alias management, merge, and review workflows.
This is useful because the same doctor may appear under different name variations across documents (e.g., “Dr. H. Mueller” vs. “Dr. Hans Mueller”). Merging these entries consolidates all document references to a single record.
How normalization works
Section titled “How normalization works”The LLM’s only job is to copy the test / medication / diagnosis / specialty name exactly as the document writes it (e.g., “Hämoglobin”). A Python resolver then maps that raw name to a canonical ID:
- Exact alias match, case-insensitive lookup against the alias table. Covered by the seed data for common tests, specialties, diagnoses, and medications.
- Fuzzy match, if no exact hit, run
rapidfuzz.process.extractOnewith a score threshold of 85. Catches OCR drift, typos, and minor language variants (“Hepatite C” vs “Hepatitis C”, “S-Creatinina” vs “Creatinina”). The matching original text is inserted as a new alias on the canonical entry so the next document with the same phrasing lands on step 1. - Auto-create, still nothing? The resolver inserts a new canonical row with
canonical_displayset to the original wording and the original text as anauto_mapped=1alias. The row appears in Settings → Normalization flagged for review.
The alias tables aren’t sent to the LLM as prompt context any more. Before this refactor the prompt carried every (canonical_code, alias) row inline, a moderately-used install pushed extraction prompts past 400 kB, which broke schema adherence on small models. Doing the match in Python is faster, deterministic, auditable, and scales with the database instead of the prompt.
Auto-mapped aliases
Section titled “Auto-mapped aliases”Any alias inserted by the resolver (steps 2 and 3 above) has auto_mapped = 1. Manually curated aliases have auto_mapped = 0. The row displays an auto badge in the Normalization UI and contributes to the entry’s “unreviewed” count. Confirm them in bulk from the row-level Confirm action.
A self-alias whose text matches the canonical display name exactly (case- and whitespace-insensitive) is auto-confirmed on insert because there’s no normalization decision to audit, it’s just the canonical form echoed back. A one-shot migration clears the auto_mapped flag on any existing self-alias of that kind.
Managing normalization data
Section titled “Managing normalization data”From Settings → Document Analysis → Normalization, you can:
View canonical entries
Section titled “View canonical entries”Browse all canonical entries for each category with their alias counts. The entry’s canonical_code and canonical_display are both shown; long values truncate with a tooltip so rows stay on one line.
Add aliases
Section titled “Add aliases”Add new aliases to map additional name variations to existing canonical entries. Each alias has:
- Alias text — the alternative name
- Language — optional language code (e.g., “it”, “de”, “en”)
Edit canonical entries
Section titled “Edit canonical entries”Update the canonical code or display name of an entry. For doctors and facilities the edit also syncs the name column (used by document lists, filter dropdowns, and the extractor’s slug matching) and pushes the new display into any denormalized documents.doctor_name / facility_name cells, so the rename is visible everywhere.
If the new code collides with another entry’s code you’ll get a 409 with the message “Another X already has code ’…’. Use Merge to unify them instead of renaming.” That’s almost always the right action: the two entries are duplicates, merge them.
View linked documents
Section titled “View linked documents”Click Documents on any row to list every document that references that entry, with patient, doc type, and date. If there are none, the modal offers a one-click Delete to clean up the orphan canonical.
Delete entries
Section titled “Delete entries”Click Delete on a row (or from the linked-documents modal) to permanently remove a canonical entry. References in every linked table (documents, encounters, imaging_studies, medications, lab_results, doctors.facility_id, …) are set to NULL; the linked documents themselves stay intact, they just lose this particular classification. Aliases are also removed.
Merge entries
Section titled “Merge entries”If two canonical entries represent the same concept, merge them. Three flows:
-
Per row. Expand a row and click Merge (the Confirm action lives inside the same expanded row). Pick a target from the fuzzy-searchable dropdown, start typing part of the canonical display or code to filter. The dropdown also offers + Create new entry…, selecting it reveals Name and Code inputs and the merge creates the new canonical row first, then folds the source into it.
-
Batch (multi-select). Tick the checkboxes on several rows. A subdued bar appears above the table: “N selected, Merge into: [target ▾], Merge, Clear”. The target dropdown is fuzzy-searchable and includes the same + Create new entry… option.
-
Auto-merge with AI. Click Auto-merge with AI in the filter row. The current entries are run through a two-stage pipeline:
- Knowledge-base resolution. For lab tests / medications / diagnoses each entry’s display + aliases are looked up against a bundled reference (LOINC / ATC / ICD-10). Entries that resolve to the same external code are grouped immediately as high-confidence proposals, no LLM call needed. The reason field reads e.g. “Same ATC code (J01CA04)”.
- LLM fallback. Any entries that didn’t resolve (typos, unusual brand names, doctors / facilities / specialties, which have no public reference) are sent to the configured General LLM, which proposes additional groups with a short reason. Those come back marked for review.
Proposals are rendered inline, you can change the target, uncheck individual sources, skip, or approve each group. Each carries a
source(knowledge_baseorllm) andconfidence(highorreview) so you can scan past the high-confidence ones quickly. Nothing is merged until you click Apply merge.
Every merge:
- Moves the source’s aliases onto the target
- Copies the source’s display name as a new alias on the target (so future extractions of the old name resolve to the merged target via the alias lookup in
_upsert_doctor/_upsert_facility) - Updates every FK reference on linked tables
- Refreshes any denormalized
doctor_name/facility_namecells to the target’s display - Logs a row in
extraction_correctionsfor every affected document so the few-shot retriever surfaces the mapping on future extractions, the same learning signal produced when you rename a doctor from the document view - Deletes the source row
Batch merges run every step inside a single transaction.
Reference data
Section titled “Reference data”Asclepius ships with two layers of bundled medical reference data:
Seed data
Section titled “Seed data”Loaded into the database on first initialization from config/seeds/:
- Common lab tests (complete blood count, metabolic panels, lipid panels, etc.)
- Medical specialties
- Common diagnoses
- Common medications
These rows live in the normalization tables themselves and behave like any other entry, you can rename, alias, merge, or delete them. The seed only fires on an empty database; subsequent boots leave your data alone.
Knowledge bases
Section titled “Knowledge bases”A separate read-only side index used by the auto-merge feature, located at bundled_config/knowledge/:
| File | Codes | Source |
|---|---|---|
medications.json | ATC | Wikidata (CC0), ~3.7k drugs with multilingual brand and generic names |
diagnoses.json | ICD-10 | Wikidata (CC0), ~600 chapter and 3-character codes with multilingual labels |
lab_tests.json | LOINC | ~480 lab tests; English display names are the official LOINC LONG_COMMON_NAME field, multilingual aliases come from the LOINC Linguistic Variants (it/fr/de/es). LOINC is © Regenstrief Institute, Inc., see NOTICE for the required attribution |
These files are not loaded into the database. They sit in memory and are consulted whenever auto-merge needs to know whether two name variants (e.g. Amoxicillin and Zimox) refer to the same underlying concept. Same code → deterministic merge; mismatch → the LLM never sees them in the same prompt and can’t accidentally collapse them.
How the knowledge base feeds the merge
Section titled “How the knowledge base feeds the merge”The knowledge files are never sent to any LLM, in part or in whole, they’re not in the system prompt, the user prompt, or any context passed to the model. The data path is entirely in-process Python:
- Lazy load. On the first auto-merge call for a given category,
knowledge_base.pyreads the relevant JSON file from disk once and builds a flatdict[str, str]mappingnormalized_alias → external_code(~30k keys for medications, ~6k for diagnoses, ~3k for lab tests). The dict is cached at module scope; later calls reuse it. Per-process memory: a few MB. - Normalization. Each alias key is casefolded, has trailing dosage tokens stripped (
"Augmentin 500mg"→"augmentin"), parenthetical content removed, and punctuation collapsed. The same normalization is applied to every entry name being looked up. - Lookup. For each canonical entry the user is reviewing, every alias and the canonical display are normalized and looked up in the dict (O(1) per name). The first hit wins; the entry is tagged with that external code. Entries that don’t hit anything are unresolved.
- Group. Resolved entries are bucketed by external code. Any bucket with two or more entries becomes a deterministic merge proposal (
source: "knowledge_base",confidence: "high") with a reason like “Same ATC code (J01CA04)”. No LLM call has happened at this point. - Residual to the LLM. Only the unresolved entries are placed into the prompt sent to the configured General LLM. The prompt does not include the knowledge dictionary, the resolved entries, the external codes, or the matches found in step 3. If everything resolved, the LLM call is skipped entirely, you’ll see no
gate.enterlog line for that auto-merge run. - Merge. Both proposal lists are returned to the UI. Nothing is written to the database until you approve.
So a lab_tests auto-merge call on a database where the knowledge base recognises every entry’s name is a pure-Python computation that costs a single 550 KB file read and a few hundred dict lookups, with zero token cost and zero outbound LLM traffic. The LLM is reserved for the long tail of locally-named entries the public references can’t decide.
The build scripts under scripts/build_knowledge/ regenerate the JSON files from public Wikidata SPARQL with no third-party dependencies. For lab tests the script also reads an optional official LOINC overlay: register at https://loinc.org, drop LoincTableCore.csv at scripts/build_knowledge/loinc.csv and the per-language LinguisticVariant CSVs at scripts/build_knowledge/loinc_{it,fr,de,es}.csv, then re-run build_lab_tests.py. The script enriches existing entries, it doesn’t expand the file with the long tail of ~109k LOINC codes that nobody references in real reports. Both inputs are gitignored so a registered LOINC distribution stays local to your machine.
A per-install override at $ASCLEPIUS_CONFIG_PATH/../knowledge/{medications,diagnoses,lab_tests}.json shadows the bundled copy, mirroring the seed-data precedence.
API endpoints
Section titled “API endpoints”| Method | Path | Description |
|---|---|---|
GET | /api/normalization/{type} | List canonical entries (type: lab_tests, specialties, diagnoses, medications, doctors, facilities) |
GET | /api/normalization/{type}/{id} | Get a canonical entry with all its aliases |
PATCH | /api/normalization/{type}/{id} | Update canonical code / display (409 on code collision) |
DELETE | /api/normalization/{type}/{id} | Delete a canonical entry (nulls FKs on referencing tables) |
GET | /api/normalization/{type}/{id}/documents | List documents that reference this entry |
POST | /api/normalization/{type}/{id}/aliases | Add an alias |
DELETE | /api/normalization/{type}/aliases/{alias_id} | Delete an alias |
POST | /api/normalization/{type}/{id}/confirm | Mark every auto-mapped alias on this entry as reviewed |
POST | /api/normalization/{type}/merge | Merge one source into a target |
POST | /api/normalization/{type}/merge-batch | Merge many sources into an existing target or a newly-created one (target_id or new_target: {canonical_code, canonical_display}) |
POST | /api/normalization/{type}/auto-merge | Ask the LLM for merge proposals; returns {proposals, entries} without executing anything |