Processing Pipeline
The pipeline is the ingestion engine. It watches the inbox folder, sends each file through OCR and LLM extraction, and files the result into the vault.
pipeline.default_flow decides which branch a new upload takes (ocr_llm or vision_llm). For existing documents, the Reprocess menu on the document page overrides the flow per-document (OCR+LLM, OCR only, LLM only, or Vision-LLM). Initial ingest and reprocess both run through the same run_extraction() strategy picker, so a 3-page blood test gets the same sectioning, chunking, or single-shot decision regardless of when it lands.
Worker Queue
Section titled “Worker Queue”The pipeline runs in a single-threaded worker thread with its own asyncio event loop. Every unit of work — fresh uploads from the inbox, manual reprocess clicks, retry-all-failed runs — is enqueued onto the same thread-safe PriorityQueue. The queue is the load-bearing primitive that enforces the “max one document at a time” invariant: there is one worker, it dequeues one job, awaits it to completion, and only then dequeues the next.
Each queue entry is a tagged tuple (priority, seq, kind, payload):
kind="upload"— payload{file_path, file_size}. Callsprocess_file().kind="reprocess"— payload{doc_id, mode, llm_provider_id, ocr_provider_id, vision_provider_id}. Callsreprocess_document().kind="translate"— payload{doc_id, llm_provider_id, resolved_providers}. Runs the cacheddocuments.ocr_textthrough thetranslation_enprompt and stores the result ondocuments.ocr_text_en. Side-job: deliberately does not flipdocuments.statusto"processing".kind="translate_region"— payload{doc_id, region_row_id, page, bbox, ocr_provider_id, llm_provider_id, resolved_providers}. Crops a normalized[0,1]rectangle on one PDF page with PyMuPDF, OCRs the crop, translates the OCR text, and updates the pre-allocated row inregion_translationswith text + thumbnail path.
priority is min-heap: smaller numbers run first. Inbox uploads use the file size (smaller files first, so a 100-page scan doesn’t head-of-line-block a one-pager). Reprocess clicks from the document detail page use priority 0, which jumps ahead of pending uploads — the user explicitly asked for it. Retry-all-failed uses priority 10 (ahead of uploads, behind interactive reprocess). seq is a monotonic counter so two jobs with the same priority compare on insertion order rather than on the payload dict.
Routing reprocess through the same queue as uploads — instead of spawning it as an asyncio.create_task on the FastAPI loop — fixes a class of concurrency bugs that used to surface as “the OCR chip switched mid-run.” Before the queue refactor, a reprocess started in the FastAPI loop and a fresh upload in the worker loop would each acquire a separate cap=1 slot from the credential gate (semaphores were keyed per-loop) and run in parallel against the same Ollama server. Both flows now share the worker loop, the queue, and the gate.
Worker per mode
Section titled “Worker per mode”In core mode the worker thread is launched by start_watcher() alongside the inbox file watcher and the scheduled-document sweeper.
In share mode the inbox watcher and scheduled-doc sweeper are not started — the public surface has no admin uploads to ingest, and a duplicate inbox watcher in a sibling container would race the core watcher on the same files. But the doctor’s translate endpoint still enqueues translate_region jobs onto the in-process queue, so a translate-only worker is launched by start_translate_worker(). Same _spawn_worker underneath, same liveness watchdog (5 s tick, respawn on death), same credential-keyed gate. Each container drains its own queue; jobs the doctor triggers never cross the process boundary.
When a file appears in vault/inbox/:
- The watcher does a short size-stability poll (≤ 2 s) so a still-being-written file doesn’t enter the queue half-formed.
- The watcher enqueues it via
enqueue_job(queue, "upload", ...). The same helper updatespipeline_status.queue_depthandqueued_jobsso the dashboard sees a non-zero queue between processing ticks. - The worker dequeues it, runs the appropriate function based on
kind, and only then loops. - After every successful tick the worker walks up from the just-handled file and
rmdirs any empty parent directory inside the inbox tree, so per-upload sub-folders don’t linger.
Configuration:
| Setting | Default | Description |
|---|---|---|
pipeline.watch_enabled | true | Enable/disable the file watcher |
pipeline.poll_interval_seconds | 5 | How often to check for new files |
pipeline.retry_interval_seconds | 300 | Wait before retrying failed extractions |
pipeline.max_retries | 3 | Maximum retry attempts |
DICOM Zip Handling
Section titled “DICOM Zip Handling”Imaging exams ship as a single .zip containing extension-less DICOM
frames (e.g. I1000000), a DICOMDIR manifest, JPEG previews, and
bookkeeping files. The upload route extracts the zip server-side
before the watcher sees the contents:
- Detect:
application/zipmime or.zipsuffix. - Validate:
zipfile.testzip()rejects corrupt archives; the uncompressed-size budget caps zip-bomb expansion. - Sanitise every member name through
safe_filename/safe_vault_join(Zip Slip rejected). - Peek byte 128, 131 of every member from the zip stream. If those
four bytes equal
DICM, the member is treated as a DICOM frame and written to disk under its final filename (<stem>.dcm); otherwise it is written as<original-name>.binwith a.zip_membersidecar carrying the original filename. Writing under the final name avoids a watcher-create race that the intermediate-rename approach used to hit. - Sidecars:
.patient_hint,.event_hint, and.user_hintare written next to every member so the pipeline links the file to the right patient / event / uploader regardless of inbox folder shape.
The pipeline worker then:
- Dispatch
.dcmtoprocess_dicom(DICOM ingest inbackend/asclepius/pipeline/dicom_ingest.py). Frames are filed underpatients/{slug}/{year}/{study-folder}/series-N/(noimaging/middle segment). The first frame creates a placeholderimaging_reportdocument; subsequent frames find it via the deterministic study hash and bump series / image counters. - Dispatch
.bintoprocess_zip_member. The file is moved topatients/{slug}/imaging-bundles/{zip_stem}/under its restored original name. Bundle files do not get their owndocumentsrow; the imaging detail page surfaces them viaGET /api/imaging/{id}/bundle-files. - Auto-attach a report PDF if the file came from
POST /api/imaging/{id}/report: an.imaging_study_hintsidecar tells the worker to repointimaging_studies.document_idat the new (post-pipeline) document and delete the placeholder.
Patient Assignment
Section titled “Patient Assignment”Documents can be pre-assigned to a patient in two ways:
- Upload via web UI — selecting a patient during upload writes a
.patient_hintfile alongside the document. When a patient is provided, the inbox sub-folder is the patient slug (e.g.inbox/alex-smith/); otherwise it falls back toinbox/user-<id>/. A.user_hintsidecar always carries the uploader id so the pipeline can stampuploaded_by_user_idregardless of the folder name. - Hint file — a file named
document.pdf.patient_hintcontaining the patient ID (a single integer).
The pipeline reads and deletes the hint files during processing, then sets the patient_id on the document record.
OCR Phase
Section titled “OCR Phase”OCR providers are configured as an ordered list in Settings. The pipeline tries each enabled provider in priority order, falling back to the next if a provider returns empty text or fails. All engines return (text, confidence, provider_name).
The provider_name stored in the database is the user-configured display name (e.g., “My Remote OCR”) rather than the technical engine type.
Provider Fallback Chain
Section titled “Provider Fallback Chain”- Try provider at priority 1
- If empty text or error → try priority 2
- Continue until text is extracted or all providers exhausted
- If all fail → mark document as
needs_review
Tesseract (Local)
Section titled “Tesseract (Local)”- For PDFs: try embedded text first (from digital PDFs)
- If embedded text is insufficient (<50 chars): render pages at 300 DPI and OCR each page
- Calculate per-page confidence from Tesseract’s word-level confidence scores
- For large documents (>20 pages): progress tracking per page
LLM Vision
Section titled “LLM Vision”- Render each PDF page as a JPEG image (150 DPI, auto-downscale if >4.5MB)
- Send each page image to the LLM (Claude, OpenAI, or Ollama with vision model)
- LLM transcribes all visible text, preserving structure
- Transient failures (ReadTimeout, ConnectError, HTTP 429/5xx) retry with per-credential backoff (defaults to
[30, 60, 120]seconds, configurable viaCredentialEntry.max_retries/retry_backoff_seconds) - Per-page calls are serialized through a process-global semaphore keyed by
credential_id. All kinds (LLM, Vision-LLM, LLM-vision OCR) on the same credential share the same slots, so an Ollama server set tomax_concurrent=1runs at most one request total — even when the FastAPI loop (chat, AI features) and the worker loop are both active. Caveat: the gate is in-memory, so split-mode deployments (acorecontainer + asharecontainer, see Doctor shares) each have their own copy. A credential reachable from both containers can run2 × max_concurrentrequests in the absolute worst case. Halve the cap or use a dedicated credential per surface if your hardware can’t handle the doubled ceiling. - Can use a separate provider/model/URL from the extraction LLM
Remote Tesseract
Section titled “Remote Tesseract”- Send the entire file to a remote Tesseract server via HTTP POST
- Server returns
{"text": "...", "confidence": 0.95} - Falls back to local Tesseract if the remote server fails
Google Cloud Vision
Section titled “Google Cloud Vision”Uses the Google Cloud Vision API for OCR. Requires an API key.
Vision-LLM Flow (alternative to OCR + LLM)
Section titled “Vision-LLM Flow (alternative to OCR + LLM)”When pipeline.default_flow is vision_llm, or the Reprocess menu is set to Vision-LLM, the pipeline takes a different path that skips the OCR and the LLM-classification steps entirely. Each page image is sent directly to a vision-capable LLM with a combined read-and-classify prompt. The model returns a single JSON document containing both ocr_text and all classification/universal fields (doc_type, dates, doctor, facility, summary).
- Iterate
vision.providers[]in priority order; fall through to the next provider on failure. - For each PDF page (or the single image), render to JPEG and send to the chosen provider (Ollama / Claude / OpenAI). Image dimensions are aligned to a 28-pixel patch grid and capped below the model’s
max_pixelsbudget (e.g.qwen2.5-vl) so the server never silently rescales. - Parse the JSON response; merge extractions across pages (first non-null value per key wins).
- Persist
ocr_text+ setocr_engine = vision_llm:<provider name>on the document. - Run Phase 2 type-specific extraction on the vision-produced OCR text using the same provider selected for vision. Lab results, medications, and diagnoses are populated even though classification came from the vision prompt.
- Call
extract_and_storewith the merged result as the override.
Retries on transient failures are controlled per-credential (max_retries, retry_backoff_seconds). Per-page vision calls share the same credential-keyed semaphore as the LLM and the LLM-vision OCR — so vision traffic, OCR traffic, and extraction traffic on the same Ollama server all contend for the same max_concurrent slots, regardless of which event loop initiated the call.
Advantages: single model pull, no model swapping, and the model sees visual layout cues (bold headers, table grids, letterhead positioning, signatures) that OCR strips away.
Best for: Documents where OCR quality is poor, or when you’d rather not maintain separate OCR + text-LLM stacks.
Recommended local model: qwen2.5vl:7b (~6 GB VRAM) on Ollama. See LLM & OCR Configuration for the full size-vs-VRAM matrix.
Two-Phase Extraction
Section titled “Two-Phase Extraction”After OCR, the extracted text is sent to the LLM in two phases:
Retrieval-Augmented Extraction (Few-Shot Examples)
Section titled “Retrieval-Augmented Extraction (Few-Shot Examples)”Before classification, the pipeline searches for similar previously-processed documents to use as few-shot examples in the prompt. This improves extraction quality, especially for smaller models like qwen2.5.
Example selection priority:
- Documents with user corrections from the same facility (highest quality, human-verified)
- Documents with user corrections from any facility
- Completed documents from the same facility
- FTS5 text similarity search (BM25 ranking on OCR text)
The system injects 1-2 compact examples (500-char OCR snippet + extraction result) into the classification prompt. If user corrections exist for an example document, the corrected values are used instead of the raw LLM output.
Facility detection happens heuristically by matching known facility names against the first 500 characters of OCR text (the letterhead area).
Phase 1: Classification
Section titled “Phase 1: Classification”A single prompt classifies the document and extracts basic metadata. The prompt is structured with the document content first, few-shot examples in the middle, and the JSON schema last (recency bias helps smaller models follow the schema).
- Document type (one of 10 canonical values: invoice, prescription, specialist_report, surgical_report, discharge, lab_test, vaccination, medical_certificate, imaging_report, other)
- Patient name (matched against existing patients)
- Doctor name (matched/created in the doctors table, with alias)
- Facility name (matched/created in the facilities table, with alias)
- Dates (
event_datefor the medical event itself,issued_datefor when the document was produced) - Specialty (normalized against the specialties table)
- Summary (English + source language)
When smaller LLMs return non-conforming JSON (e.g., using responsible instead of doctor), a salvage step attempts to map common alternative key names to the expected schema.
The LLM provider name and model used for extraction are stored on the document (visible under “Processing details” in the document view).
Phase 2: Type-Specific Extraction
Section titled “Phase 2: Type-Specific Extraction”Based on the classified document type, a type-specific prompt extracts detailed structured data:
| Document Type | Extracted Data |
|---|---|
lab_test | Lab results (test name, value, unit, reference range, abnormal flag) |
specialist_report | Encounters (diagnosis, findings, follow-up), medications |
prescription | Medications (name, dosage, form, frequency, duration) |
invoice | Invoice line items (description, amount, tariff code, category) |
discharge | Encounters, medications, diagnoses, follow-up instructions |
imaging_report | Imaging findings, diagnoses |
vaccination | Vaccination records (vaccine, manufacturer, lot, dose number) |
surgical_report | Encounters with operative details |
medical_certificate | Sick leave or fitness statements (period, restrictions) |
other | No type-specific extraction; falls back to summary only |
Smart Page-Level Sectioning
Section titled “Smart Page-Level Sectioning”For PDFs with more than 5 pages (should_section()), the pipeline classifies pages individually and extracts each group with its own prompt instead of sending the whole document to a single extraction call.
Page Classification Types
Section titled “Page Classification Types”| Type | Description |
|---|---|
lab_results_page | Laboratory test results |
clinical_notes | Doctor’s clinical notes |
nursing_notes | Nursing observations |
operative_notes | Surgical operation details |
discharge_summary | Discharge summary |
imaging_report | Radiology/imaging report |
medication_chart | Medication administration records |
vital_signs | Vital signs monitoring |
consent_form | Patient consent (skipped for extraction) |
cover_page | Cover/title page (skipped for extraction) |
invoice_page | Billing/invoice page |
correspondence | Letters and correspondence |
other | Unclassified content |
Sectioning Process
Section titled “Sectioning Process”- Page classification — Pages are sent in batches of 10 to the LLM for classification
- Grouping — Consecutive pages of the same type are merged into sections
- Per-section extraction — Each section is extracted using the appropriate type-specific prompt
- Section summary — Each section gets a brief English summary
- Aggregation — All section extractions are merged, deduplicating lab results, medications, etc.
- Document-level classification — A classification prompt runs on the first ~5000 characters for overall document metadata
Sections are stored in the document_sections table and are visible in the document detail page.
Chunked Extraction
Section titled “Chunked Extraction”For documents that are not large enough for sectioning, chunking is triggered whenever the cached OCR has more than one page or the concatenated OCR text exceeds 8,000 characters. This is deliberately aggressive: multi-page blood-test tables often fit well under the LLM’s input cap but overflow its output cap, so later-page rows are silently dropped if sent as a single prompt.
Two phases per document
Section titled “Two phases per document”Chunked extraction runs the same two phases as the non-chunked path, but with Phase 2 repeated per chunk and merged:
- Phase 1, classify on chunk 1. A short-schema call that captures universal fields (doc_type, patient, doctor, facility, dates, summary, language). Keeping it separate means small models only have to fit one concern in working memory at a time, the reason qwen2.5:14b reliably returns these fields now instead of zooming in on the loudest section (lab table) and dropping everything else.
- Phase 2, type-specific extraction per chunk. Runs the prompt for the doc_type picked in Phase 1 (e.g.
lab_test→ lab-results-only schema). Each chunk produces its own extraction;merge_extractionsdedupes overlap.
Page-aligned chunks
Section titled “Page-aligned chunks”- Pages are loaded from the
ocr_page_cachetable (populated during OCR). - Pages are greedily packed into chunks up to
_TARGET_CHUNK_CHARS(~10k). - The last page of each chunk is repeated as the first page of the next chunk, so any table spanning a page boundary is visible in full to at least one chunk.
- A preamble (
Chunk i of N, pages X-Y of Z, overlaps previous chunk) is prepended so the LLM treats the text in context.
Truncation-aware retry
Section titled “Truncation-aware retry”Each chunk is extracted in-memory; the merged result is stored exactly once at the end. If a chunk response is flagged _truncated or _truncation_suspected and contains more than one page, the chunk is bisected into two halves and each half is retried (depth-capped at 2). The bisection path keeps writes idempotent because nothing hits the DB until all chunks have succeeded.
Small models can also self-truncate mid-JSON well before the token cap is reached, the JSON parser detects an unclosed structure and flags the response as truncated, which feeds the same bisection loop. So “one chunk” on the first attempt often becomes “two single-page halves” on retry, and both finish cleanly.
Merging & logging
Section titled “Merging & logging”merge_extractions deduplicates by:
test_name_originalfor lab resultsbrand_name + active_ingredient_originalfor medicationsdiagnosis_originalfor diagnosesvaccine_name + date_administeredfor vaccinationsdescription + amountfor invoice line items
After merging, a page coverage line is logged: pages covered=N/total, number of lab results/medications/diagnoses produced, and a [TRUNCATION DETECTED] tag if any chunk (even after bisection) still hit the output cap. Missing pages show up explicitly instead of being lost silently.
Cancellation
Section titled “Cancellation”Document processing can be cancelled at any time from the web UI. The Cancel button triggers two mechanisms at once (belt-and-braces):
- Hard cancel, the pipeline keeps a registry of in-flight asyncio tasks keyed by
doc_id. On cancel, the API callsasyncio.Task.cancel()on the registered task. This propagatesCancelledErrorinto whicheverawaitthe pipeline is parked on (typically the httpx POST to the LLM), aborts the HTTP connection, releases the credential gate slot viaasync withfinalizers, and marks the documentcancelled. The UI chip disappears within a second. - Cooperative flag, the API also adds the doc_id to an in-memory
cancelled_docsset. Every phase boundary (before OCR, between OCR and LLM, after LLM) checks the set and exits early. This is the fallback for the rare case where the hard cancel can’t interrupt the current await (e.g. C-level blocking syscall).
Before this was belt-and-braces, cancel was cooperative only: a mid-extraction click had to wait for the LLM to finish its current call before the pipeline would notice. Reprocess also didn’t honour the flag at all. Both are fixed.
Name Normalization
Section titled “Name Normalization”Everything that references a canonical table, lab tests, medications, diagnoses, specialties, doctors, facilities, is matched in Python after extraction, not inside the LLM prompt. The LLM emits the document’s original wording; asclepius.normalization.resolver does the rest:
- Exact case-insensitive match against the alias table.
- Fuzzy match via
rapidfuzz.process.extractOne(WRatio ≥ 85). Catches OCR drift and minor language variants. - Auto-create a new canonical row if nothing matches, with the original wording as
canonical_displayand as anauto_mapped=1alias for user review in the Normalization UI. - Document type names are also normalized against fuzzy alias tables.
Doctors and facilities still go through their dedicated _upsert_* helpers (slug matching + alias-aware upsert), which predate the resolver but behave the same way.
Before this refactor the prompt carried every (canonical_code, alias) pair inline so the LLM could pick. On a real install that payload reached 437 kB and broke schema adherence on smaller models. Doing the match in Python cut the extraction prompt to ~15-20 kB and made qwen2.5:14b viable end-to-end.
Chat context
Section titled “Chat context”The chat system prompt uses the same philosophy: no full entity tables in the prompt. build_patient_context (backend/asclepius/chat/message_builder.py) assembles a bounded per-patient rollup, identity plus the last 10 documents, 20 lab results, and 10 medications, and substitutes it into {patient_context}. Anything outside that window is reachable through the LLM-generated SQL path, not through more prompt stuffing. There is no MCP server and no vector retrieval; the SQL-generation prompt is the tool-call. Only classification, document_edit, and the legacy extraction_legacy prompt still ship the full patient_list / facility_list / doctor_list JSON, every other path is either deterministic (normalization) or bounded (chat rollup).
Progress Tracking
Section titled “Progress Tracking”The pipeline maintains an in-memory status dict visible via GET /api/pipeline/status:
{ "queue_depth": 2, "processing": "document.pdf", "processing_step": "llm_extraction", "processing_doc_id": 42, "processing_pages": 15, "processing_page_current": 7, "last_processed": "previous.pdf", "total_processed": 128, "total_errors": 3, "recent_errors": [], "queued_files": [ {"filename": "next.pdf", "size": 1234567} ], "current_job": { "doc_id": 42, "filename": "document.pdf", "kind": "reprocess", "stage": "llm_extraction", "page_current": 7, "page_total": 15, "stages_planned": ["ocr", "llm_extraction"], "stages_done": ["ocr"], "started_at": "2026-04-29T11:38:08" }, "queued_jobs": [ {"kind": "upload", "label": "next.pdf", "doc_id": null} ], "llm_queues": [], "watcher_active": true, "auto_stopped": false, "auto_stop_reason": ""}current_job is the richer descriptor the dashboard’s PipelineProgress widget renders against — it carries the job kind (upload / reprocess), the stages this job will go through (stages_planned), the ones already completed (stages_done), and the live page progress when OCR or vision is mid-run. queued_jobs mirrors the worker queue so the dashboard’s “Up next” rail shows what’s waiting. The legacy processing / processing_step / processing_pages fields are kept populated for backward compatibility — older clients (and the top-bar chip’s fallback path) still read them.
The flow architecture is implicit in stages_planned: a list containing vision_extraction is the Vision-LLM flow, a list containing ocr is the OCR + LLM flow. The frontend renders a small flow-type pill from this so users can A/B-compare flows without reading the stage list.
Stage Timeline
Section titled “Stage Timeline”Every stage transition is also persisted to the document_stage_events table — one row per (document_id, stage, status) event. Recording is best-effort and runs alongside the in-memory pipeline_status updates: a hiccup on the events table never breaks the actual extraction run.
A stage event carries:
stage—ocr,vision_extraction,llm_extraction,page_classification,section_extraction,organizing,thumbnail,cache_ocr,translation(whole-document English translate),region_ocr/region_translation(region-on-PDF translate)status—started/completed/failed/skipped/cancelledjob_kind—uploadorreprocessstarted_atandfinished_at— for duration calculationmessage— populated on failures with the error messagepage_current/page_total— set on stages that work page-by-page (OCR, vision)
The document detail page reads these via GET /api/documents/{id}/stages and renders them as a vertical run-grouped timeline: each upload or reprocess invocation gets its own card, status-coloured marker dots show success/failure/cancellation per stage, and the run header summarises kind, total duration, outcome, and flow type. While the doc is the active pipeline job, the timeline polls every 2.5 s so stages tick in live.
Stage events also let the dashboard’s idle state remain useful — clicking on a recently-finished doc surfaces its full processing history (including past failures and reprocesses) without needing the in-memory status to still be populated.
Runtime Pipeline Control
Section titled “Runtime Pipeline Control”The pipeline can be started and stopped at runtime from the Settings UI without restarting the application:
- Start/Stop buttons in Settings > Pipeline tab
POST /api/pipeline/startandPOST /api/pipeline/stopendpoints (admin only)- Toggling
pipeline_watch_enabledin settings also starts/stops the pipeline immediately
Auto-Stop on Provider Failures
Section titled “Auto-Stop on Provider Failures”If the pipeline encounters 5 consecutive provider connectivity failures (connection refused, timeout, HTTP 5xx), it automatically pauses and sets an auto_stopped flag. A warning banner appears in the Settings UI with a “Restart” button.
Only connectivity errors trigger auto-stop. Document-specific extraction failures (malformed content, unsupported format) do not.
Extraction Validation
Section titled “Extraction Validation”After LLM extraction, the pipeline validates that at least one meaningful field was produced (doc_type, summary, dates, lab results, medications, or diagnoses). If the extraction is completely empty, the document is marked needs_review with the error message “LLM extraction returned empty results” instead of being silently marked as done.
Correction-Driven Learning
Section titled “Correction-Driven Learning”When users manually edit document metadata (doc_type, dates, doctor name, facility name, summary, etc.) through the web UI or the AI Edit feature, the system captures these corrections as training signals.
Each correction records:
- Document ID — which document was corrected
- Field name — which field was changed (e.g.
doctor_name,doc_type) - LLM value — what the LLM originally extracted (from
raw_extraction) - Corrected value — what the user set
- Facility ID and doc type — denormalized for fast lookup by facility/type
These corrections serve two purposes:
- Few-shot example quality — Documents with corrections are preferred as few-shot examples in retrieval-augmented extraction, since they represent human-verified ground truth
- Learning signal — Corrections from the same facility are especially valuable, as documents from the same source share the same layout and formatting patterns
Corrections are logged transparently; no UI changes needed. The system compares each edit against the original raw_extraction JSON and only logs fields that actually differ from what the LLM produced.