Skip to content

Processing Pipeline

The pipeline is the ingestion engine. It watches the inbox folder, sends each file through OCR and LLM extraction, and files the result into the vault.

DUP HIT OCR_LLM VISION_LLM File detected in inbox/ SHA-256 + record read .patient_hint duplicate? Skip — already processed default_flow router OCR Provider chain tesseract → remote → llm-vision → gvision VISION Read + classify qwen2.5-vl · claude gpt-4o (vision) ocr_page_cache merged JSON + OCR text STRATEGY run_extraction() section · chunk · single Phase 1 — classify + retrieval few-shot Phase 2 — type-specific labs · meds · diagnoses Store + organize file Reprocessing reuses the same picker — section vs. chunk vs. single-shot is decided in exactly one place.

pipeline.default_flow decides which branch a new upload takes (ocr_llm or vision_llm). For existing documents, the Reprocess menu on the document page overrides the flow per-document (OCR+LLM, OCR only, LLM only, or Vision-LLM). Initial ingest and reprocess both run through the same run_extraction() strategy picker, so a 3-page blood test gets the same sectioning, chunking, or single-shot decision regardless of when it lands.

The pipeline runs in a single-threaded worker thread with its own asyncio event loop. Every unit of work — fresh uploads from the inbox, manual reprocess clicks, retry-all-failed runs — is enqueued onto the same thread-safe PriorityQueue. The queue is the load-bearing primitive that enforces the “max one document at a time” invariant: there is one worker, it dequeues one job, awaits it to completion, and only then dequeues the next.

Each queue entry is a tagged tuple (priority, seq, kind, payload):

  • kind="upload" — payload {file_path, file_size}. Calls process_file().
  • kind="reprocess" — payload {doc_id, mode, llm_provider_id, ocr_provider_id, vision_provider_id}. Calls reprocess_document().
  • kind="translate" — payload {doc_id, llm_provider_id, resolved_providers}. Runs the cached documents.ocr_text through the translation_en prompt and stores the result on documents.ocr_text_en. Side-job: deliberately does not flip documents.status to "processing".
  • kind="translate_region" — payload {doc_id, region_row_id, page, bbox, ocr_provider_id, llm_provider_id, resolved_providers}. Crops a normalized [0,1] rectangle on one PDF page with PyMuPDF, OCRs the crop, translates the OCR text, and updates the pre-allocated row in region_translations with text + thumbnail path.

priority is min-heap: smaller numbers run first. Inbox uploads use the file size (smaller files first, so a 100-page scan doesn’t head-of-line-block a one-pager). Reprocess clicks from the document detail page use priority 0, which jumps ahead of pending uploads — the user explicitly asked for it. Retry-all-failed uses priority 10 (ahead of uploads, behind interactive reprocess). seq is a monotonic counter so two jobs with the same priority compare on insertion order rather than on the payload dict.

Routing reprocess through the same queue as uploads — instead of spawning it as an asyncio.create_task on the FastAPI loop — fixes a class of concurrency bugs that used to surface as “the OCR chip switched mid-run.” Before the queue refactor, a reprocess started in the FastAPI loop and a fresh upload in the worker loop would each acquire a separate cap=1 slot from the credential gate (semaphores were keyed per-loop) and run in parallel against the same Ollama server. Both flows now share the worker loop, the queue, and the gate.

In core mode the worker thread is launched by start_watcher() alongside the inbox file watcher and the scheduled-document sweeper.

In share mode the inbox watcher and scheduled-doc sweeper are not started — the public surface has no admin uploads to ingest, and a duplicate inbox watcher in a sibling container would race the core watcher on the same files. But the doctor’s translate endpoint still enqueues translate_region jobs onto the in-process queue, so a translate-only worker is launched by start_translate_worker(). Same _spawn_worker underneath, same liveness watchdog (5 s tick, respawn on death), same credential-keyed gate. Each container drains its own queue; jobs the doctor triggers never cross the process boundary.

When a file appears in vault/inbox/:

  1. The watcher does a short size-stability poll (≤ 2 s) so a still-being-written file doesn’t enter the queue half-formed.
  2. The watcher enqueues it via enqueue_job(queue, "upload", ...). The same helper updates pipeline_status.queue_depth and queued_jobs so the dashboard sees a non-zero queue between processing ticks.
  3. The worker dequeues it, runs the appropriate function based on kind, and only then loops.
  4. After every successful tick the worker walks up from the just-handled file and rmdirs any empty parent directory inside the inbox tree, so per-upload sub-folders don’t linger.

Configuration:

SettingDefaultDescription
pipeline.watch_enabledtrueEnable/disable the file watcher
pipeline.poll_interval_seconds5How often to check for new files
pipeline.retry_interval_seconds300Wait before retrying failed extractions
pipeline.max_retries3Maximum retry attempts

Imaging exams ship as a single .zip containing extension-less DICOM frames (e.g. I1000000), a DICOMDIR manifest, JPEG previews, and bookkeeping files. The upload route extracts the zip server-side before the watcher sees the contents:

  1. Detect: application/zip mime or .zip suffix.
  2. Validate: zipfile.testzip() rejects corrupt archives; the uncompressed-size budget caps zip-bomb expansion.
  3. Sanitise every member name through safe_filename / safe_vault_join (Zip Slip rejected).
  4. Peek byte 128, 131 of every member from the zip stream. If those four bytes equal DICM, the member is treated as a DICOM frame and written to disk under its final filename (<stem>.dcm); otherwise it is written as <original-name>.bin with a .zip_member sidecar carrying the original filename. Writing under the final name avoids a watcher-create race that the intermediate-rename approach used to hit.
  5. Sidecars: .patient_hint, .event_hint, and .user_hint are written next to every member so the pipeline links the file to the right patient / event / uploader regardless of inbox folder shape.

The pipeline worker then:

  1. Dispatch .dcm to process_dicom (DICOM ingest in backend/asclepius/pipeline/dicom_ingest.py). Frames are filed under patients/{slug}/{year}/{study-folder}/series-N/ (no imaging/ middle segment). The first frame creates a placeholder imaging_report document; subsequent frames find it via the deterministic study hash and bump series / image counters.
  2. Dispatch .bin to process_zip_member. The file is moved to patients/{slug}/imaging-bundles/{zip_stem}/ under its restored original name. Bundle files do not get their own documents row; the imaging detail page surfaces them via GET /api/imaging/{id}/bundle-files.
  3. Auto-attach a report PDF if the file came from POST /api/imaging/{id}/report: an .imaging_study_hint sidecar tells the worker to repoint imaging_studies.document_id at the new (post-pipeline) document and delete the placeholder.

Documents can be pre-assigned to a patient in two ways:

  1. Upload via web UI — selecting a patient during upload writes a .patient_hint file alongside the document. When a patient is provided, the inbox sub-folder is the patient slug (e.g. inbox/alex-smith/); otherwise it falls back to inbox/user-<id>/. A .user_hint sidecar always carries the uploader id so the pipeline can stamp uploaded_by_user_id regardless of the folder name.
  2. Hint file — a file named document.pdf.patient_hint containing the patient ID (a single integer).

The pipeline reads and deletes the hint files during processing, then sets the patient_id on the document record.

OCR providers are configured as an ordered list in Settings. The pipeline tries each enabled provider in priority order, falling back to the next if a provider returns empty text or fails. All engines return (text, confidence, provider_name).

The provider_name stored in the database is the user-configured display name (e.g., “My Remote OCR”) rather than the technical engine type.

  1. Try provider at priority 1
  2. If empty text or error → try priority 2
  3. Continue until text is extracted or all providers exhausted
  4. If all fail → mark document as needs_review
  1. For PDFs: try embedded text first (from digital PDFs)
  2. If embedded text is insufficient (<50 chars): render pages at 300 DPI and OCR each page
  3. Calculate per-page confidence from Tesseract’s word-level confidence scores
  4. For large documents (>20 pages): progress tracking per page
  1. Render each PDF page as a JPEG image (150 DPI, auto-downscale if >4.5MB)
  2. Send each page image to the LLM (Claude, OpenAI, or Ollama with vision model)
  3. LLM transcribes all visible text, preserving structure
  4. Transient failures (ReadTimeout, ConnectError, HTTP 429/5xx) retry with per-credential backoff (defaults to [30, 60, 120] seconds, configurable via CredentialEntry.max_retries / retry_backoff_seconds)
  5. Per-page calls are serialized through a process-global semaphore keyed by credential_id. All kinds (LLM, Vision-LLM, LLM-vision OCR) on the same credential share the same slots, so an Ollama server set to max_concurrent=1 runs at most one request total — even when the FastAPI loop (chat, AI features) and the worker loop are both active. Caveat: the gate is in-memory, so split-mode deployments (a core container + a share container, see Doctor shares) each have their own copy. A credential reachable from both containers can run 2 × max_concurrent requests in the absolute worst case. Halve the cap or use a dedicated credential per surface if your hardware can’t handle the doubled ceiling.
  6. Can use a separate provider/model/URL from the extraction LLM
  1. Send the entire file to a remote Tesseract server via HTTP POST
  2. Server returns {"text": "...", "confidence": 0.95}
  3. Falls back to local Tesseract if the remote server fails

Uses the Google Cloud Vision API for OCR. Requires an API key.

Vision-LLM Flow (alternative to OCR + LLM)

Section titled “Vision-LLM Flow (alternative to OCR + LLM)”

When pipeline.default_flow is vision_llm, or the Reprocess menu is set to Vision-LLM, the pipeline takes a different path that skips the OCR and the LLM-classification steps entirely. Each page image is sent directly to a vision-capable LLM with a combined read-and-classify prompt. The model returns a single JSON document containing both ocr_text and all classification/universal fields (doc_type, dates, doctor, facility, summary).

  1. Iterate vision.providers[] in priority order; fall through to the next provider on failure.
  2. For each PDF page (or the single image), render to JPEG and send to the chosen provider (Ollama / Claude / OpenAI). Image dimensions are aligned to a 28-pixel patch grid and capped below the model’s max_pixels budget (e.g. qwen2.5-vl) so the server never silently rescales.
  3. Parse the JSON response; merge extractions across pages (first non-null value per key wins).
  4. Persist ocr_text + set ocr_engine = vision_llm:<provider name> on the document.
  5. Run Phase 2 type-specific extraction on the vision-produced OCR text using the same provider selected for vision. Lab results, medications, and diagnoses are populated even though classification came from the vision prompt.
  6. Call extract_and_store with the merged result as the override.

Retries on transient failures are controlled per-credential (max_retries, retry_backoff_seconds). Per-page vision calls share the same credential-keyed semaphore as the LLM and the LLM-vision OCR — so vision traffic, OCR traffic, and extraction traffic on the same Ollama server all contend for the same max_concurrent slots, regardless of which event loop initiated the call.

Advantages: single model pull, no model swapping, and the model sees visual layout cues (bold headers, table grids, letterhead positioning, signatures) that OCR strips away.

Best for: Documents where OCR quality is poor, or when you’d rather not maintain separate OCR + text-LLM stacks.

Recommended local model: qwen2.5vl:7b (~6 GB VRAM) on Ollama. See LLM & OCR Configuration for the full size-vs-VRAM matrix.

After OCR, the extracted text is sent to the LLM in two phases:

Retrieval-Augmented Extraction (Few-Shot Examples)

Section titled “Retrieval-Augmented Extraction (Few-Shot Examples)”

Before classification, the pipeline searches for similar previously-processed documents to use as few-shot examples in the prompt. This improves extraction quality, especially for smaller models like qwen2.5.

Example selection priority:

  1. Documents with user corrections from the same facility (highest quality, human-verified)
  2. Documents with user corrections from any facility
  3. Completed documents from the same facility
  4. FTS5 text similarity search (BM25 ranking on OCR text)

The system injects 1-2 compact examples (500-char OCR snippet + extraction result) into the classification prompt. If user corrections exist for an example document, the corrected values are used instead of the raw LLM output.

Facility detection happens heuristically by matching known facility names against the first 500 characters of OCR text (the letterhead area).

A single prompt classifies the document and extracts basic metadata. The prompt is structured with the document content first, few-shot examples in the middle, and the JSON schema last (recency bias helps smaller models follow the schema).

  • Document type (one of 10 canonical values: invoice, prescription, specialist_report, surgical_report, discharge, lab_test, vaccination, medical_certificate, imaging_report, other)
  • Patient name (matched against existing patients)
  • Doctor name (matched/created in the doctors table, with alias)
  • Facility name (matched/created in the facilities table, with alias)
  • Dates (event_date for the medical event itself, issued_date for when the document was produced)
  • Specialty (normalized against the specialties table)
  • Summary (English + source language)

When smaller LLMs return non-conforming JSON (e.g., using responsible instead of doctor), a salvage step attempts to map common alternative key names to the expected schema.

The LLM provider name and model used for extraction are stored on the document (visible under “Processing details” in the document view).

Based on the classified document type, a type-specific prompt extracts detailed structured data:

Document TypeExtracted Data
lab_testLab results (test name, value, unit, reference range, abnormal flag)
specialist_reportEncounters (diagnosis, findings, follow-up), medications
prescriptionMedications (name, dosage, form, frequency, duration)
invoiceInvoice line items (description, amount, tariff code, category)
dischargeEncounters, medications, diagnoses, follow-up instructions
imaging_reportImaging findings, diagnoses
vaccinationVaccination records (vaccine, manufacturer, lot, dose number)
surgical_reportEncounters with operative details
medical_certificateSick leave or fitness statements (period, restrictions)
otherNo type-specific extraction; falls back to summary only

For PDFs with more than 5 pages (should_section()), the pipeline classifies pages individually and extracts each group with its own prompt instead of sending the whole document to a single extraction call.

1 · INPUT 2 · PER-PAGE OCR 3 · CLASSIFY + GROUP 4 · MERGE Multi-page PDF page_count > 5 should_section() OCR … N cached in ocr_page_cache CLASSIFY batches of 10 → group consecutive same-type Merge dedup labs · meds diagnoses · vaccines PAGE TYPES lab_results_page clinical_notes cover_page (skipped) A discharge summary with cover, history, and lab tables ends up as 3 sections, not one wall of text.
TypeDescription
lab_results_pageLaboratory test results
clinical_notesDoctor’s clinical notes
nursing_notesNursing observations
operative_notesSurgical operation details
discharge_summaryDischarge summary
imaging_reportRadiology/imaging report
medication_chartMedication administration records
vital_signsVital signs monitoring
consent_formPatient consent (skipped for extraction)
cover_pageCover/title page (skipped for extraction)
invoice_pageBilling/invoice page
correspondenceLetters and correspondence
otherUnclassified content
  1. Page classification — Pages are sent in batches of 10 to the LLM for classification
  2. Grouping — Consecutive pages of the same type are merged into sections
  3. Per-section extraction — Each section is extracted using the appropriate type-specific prompt
  4. Section summary — Each section gets a brief English summary
  5. Aggregation — All section extractions are merged, deduplicating lab results, medications, etc.
  6. Document-level classification — A classification prompt runs on the first ~5000 characters for overall document metadata

Sections are stored in the document_sections table and are visible in the document detail page.

For documents that are not large enough for sectioning, chunking is triggered whenever the cached OCR has more than one page or the concatenated OCR text exceeds 8,000 characters. This is deliberately aggressive: multi-page blood-test tables often fit well under the LLM’s input cap but overflow its output cap, so later-page rows are silently dropped if sent as a single prompt.

Chunked extraction runs the same two phases as the non-chunked path, but with Phase 2 repeated per chunk and merged:

  1. Phase 1, classify on chunk 1. A short-schema call that captures universal fields (doc_type, patient, doctor, facility, dates, summary, language). Keeping it separate means small models only have to fit one concern in working memory at a time, the reason qwen2.5:14b reliably returns these fields now instead of zooming in on the loudest section (lab table) and dropping everything else.
  2. Phase 2, type-specific extraction per chunk. Runs the prompt for the doc_type picked in Phase 1 (e.g. lab_test → lab-results-only schema). Each chunk produces its own extraction; merge_extractions dedupes overlap.
  1. Pages are loaded from the ocr_page_cache table (populated during OCR).
  2. Pages are greedily packed into chunks up to _TARGET_CHUNK_CHARS (~10k).
  3. The last page of each chunk is repeated as the first page of the next chunk, so any table spanning a page boundary is visible in full to at least one chunk.
  4. A preamble (Chunk i of N, pages X-Y of Z, overlaps previous chunk) is prepended so the LLM treats the text in context.

Each chunk is extracted in-memory; the merged result is stored exactly once at the end. If a chunk response is flagged _truncated or _truncation_suspected and contains more than one page, the chunk is bisected into two halves and each half is retried (depth-capped at 2). The bisection path keeps writes idempotent because nothing hits the DB until all chunks have succeeded.

Small models can also self-truncate mid-JSON well before the token cap is reached, the JSON parser detects an unclosed structure and flags the response as truncated, which feeds the same bisection loop. So “one chunk” on the first attempt often becomes “two single-page halves” on retry, and both finish cleanly.

merge_extractions deduplicates by:

  • test_name_original for lab results
  • brand_name + active_ingredient_original for medications
  • diagnosis_original for diagnoses
  • vaccine_name + date_administered for vaccinations
  • description + amount for invoice line items

After merging, a page coverage line is logged: pages covered=N/total, number of lab results/medications/diagnoses produced, and a [TRUNCATION DETECTED] tag if any chunk (even after bisection) still hit the output cap. Missing pages show up explicitly instead of being lost silently.

Document processing can be cancelled at any time from the web UI. The Cancel button triggers two mechanisms at once (belt-and-braces):

  1. Hard cancel, the pipeline keeps a registry of in-flight asyncio tasks keyed by doc_id. On cancel, the API calls asyncio.Task.cancel() on the registered task. This propagates CancelledError into whichever await the pipeline is parked on (typically the httpx POST to the LLM), aborts the HTTP connection, releases the credential gate slot via async with finalizers, and marks the document cancelled. The UI chip disappears within a second.
  2. Cooperative flag, the API also adds the doc_id to an in-memory cancelled_docs set. Every phase boundary (before OCR, between OCR and LLM, after LLM) checks the set and exits early. This is the fallback for the rare case where the hard cancel can’t interrupt the current await (e.g. C-level blocking syscall).

Before this was belt-and-braces, cancel was cooperative only: a mid-extraction click had to wait for the LLM to finish its current call before the pipeline would notice. Reprocess also didn’t honour the flag at all. Both are fixed.

Everything that references a canonical table, lab tests, medications, diagnoses, specialties, doctors, facilities, is matched in Python after extraction, not inside the LLM prompt. The LLM emits the document’s original wording; asclepius.normalization.resolver does the rest:

  1. Exact case-insensitive match against the alias table.
  2. Fuzzy match via rapidfuzz.process.extractOne (WRatio ≥ 85). Catches OCR drift and minor language variants.
  3. Auto-create a new canonical row if nothing matches, with the original wording as canonical_display and as an auto_mapped=1 alias for user review in the Normalization UI.
  4. Document type names are also normalized against fuzzy alias tables.

Doctors and facilities still go through their dedicated _upsert_* helpers (slug matching + alias-aware upsert), which predate the resolver but behave the same way.

Before this refactor the prompt carried every (canonical_code, alias) pair inline so the LLM could pick. On a real install that payload reached 437 kB and broke schema adherence on smaller models. Doing the match in Python cut the extraction prompt to ~15-20 kB and made qwen2.5:14b viable end-to-end.

The chat system prompt uses the same philosophy: no full entity tables in the prompt. build_patient_context (backend/asclepius/chat/message_builder.py) assembles a bounded per-patient rollup, identity plus the last 10 documents, 20 lab results, and 10 medications, and substitutes it into {patient_context}. Anything outside that window is reachable through the LLM-generated SQL path, not through more prompt stuffing. There is no MCP server and no vector retrieval; the SQL-generation prompt is the tool-call. Only classification, document_edit, and the legacy extraction_legacy prompt still ship the full patient_list / facility_list / doctor_list JSON, every other path is either deterministic (normalization) or bounded (chat rollup).

The pipeline maintains an in-memory status dict visible via GET /api/pipeline/status:

{
"queue_depth": 2,
"processing": "document.pdf",
"processing_step": "llm_extraction",
"processing_doc_id": 42,
"processing_pages": 15,
"processing_page_current": 7,
"last_processed": "previous.pdf",
"total_processed": 128,
"total_errors": 3,
"recent_errors": [],
"queued_files": [
{"filename": "next.pdf", "size": 1234567}
],
"current_job": {
"doc_id": 42,
"filename": "document.pdf",
"kind": "reprocess",
"stage": "llm_extraction",
"page_current": 7,
"page_total": 15,
"stages_planned": ["ocr", "llm_extraction"],
"stages_done": ["ocr"],
"started_at": "2026-04-29T11:38:08"
},
"queued_jobs": [
{"kind": "upload", "label": "next.pdf", "doc_id": null}
],
"llm_queues": [],
"watcher_active": true,
"auto_stopped": false,
"auto_stop_reason": ""
}

current_job is the richer descriptor the dashboard’s PipelineProgress widget renders against — it carries the job kind (upload / reprocess), the stages this job will go through (stages_planned), the ones already completed (stages_done), and the live page progress when OCR or vision is mid-run. queued_jobs mirrors the worker queue so the dashboard’s “Up next” rail shows what’s waiting. The legacy processing / processing_step / processing_pages fields are kept populated for backward compatibility — older clients (and the top-bar chip’s fallback path) still read them.

The flow architecture is implicit in stages_planned: a list containing vision_extraction is the Vision-LLM flow, a list containing ocr is the OCR + LLM flow. The frontend renders a small flow-type pill from this so users can A/B-compare flows without reading the stage list.

Every stage transition is also persisted to the document_stage_events table — one row per (document_id, stage, status) event. Recording is best-effort and runs alongside the in-memory pipeline_status updates: a hiccup on the events table never breaks the actual extraction run.

A stage event carries:

  • stageocr, vision_extraction, llm_extraction, page_classification, section_extraction, organizing, thumbnail, cache_ocr, translation (whole-document English translate), region_ocr / region_translation (region-on-PDF translate)
  • statusstarted / completed / failed / skipped / cancelled
  • job_kindupload or reprocess
  • started_at and finished_at — for duration calculation
  • message — populated on failures with the error message
  • page_current / page_total — set on stages that work page-by-page (OCR, vision)

The document detail page reads these via GET /api/documents/{id}/stages and renders them as a vertical run-grouped timeline: each upload or reprocess invocation gets its own card, status-coloured marker dots show success/failure/cancellation per stage, and the run header summarises kind, total duration, outcome, and flow type. While the doc is the active pipeline job, the timeline polls every 2.5 s so stages tick in live.

Stage events also let the dashboard’s idle state remain useful — clicking on a recently-finished doc surfaces its full processing history (including past failures and reprocesses) without needing the in-memory status to still be populated.

The pipeline can be started and stopped at runtime from the Settings UI without restarting the application:

  • Start/Stop buttons in Settings > Pipeline tab
  • POST /api/pipeline/start and POST /api/pipeline/stop endpoints (admin only)
  • Toggling pipeline_watch_enabled in settings also starts/stops the pipeline immediately

If the pipeline encounters 5 consecutive provider connectivity failures (connection refused, timeout, HTTP 5xx), it automatically pauses and sets an auto_stopped flag. A warning banner appears in the Settings UI with a “Restart” button.

Only connectivity errors trigger auto-stop. Document-specific extraction failures (malformed content, unsupported format) do not.

After LLM extraction, the pipeline validates that at least one meaningful field was produced (doc_type, summary, dates, lab results, medications, or diagnoses). If the extraction is completely empty, the document is marked needs_review with the error message “LLM extraction returned empty results” instead of being silently marked as done.

When users manually edit document metadata (doc_type, dates, doctor name, facility name, summary, etc.) through the web UI or the AI Edit feature, the system captures these corrections as training signals.

Each correction records:

  • Document ID — which document was corrected
  • Field name — which field was changed (e.g. doctor_name, doc_type)
  • LLM value — what the LLM originally extracted (from raw_extraction)
  • Corrected value — what the user set
  • Facility ID and doc type — denormalized for fast lookup by facility/type

These corrections serve two purposes:

  1. Few-shot example quality — Documents with corrections are preferred as few-shot examples in retrieval-augmented extraction, since they represent human-verified ground truth
  2. Learning signal — Corrections from the same facility are especially valuable, as documents from the same source share the same layout and formatting patterns

Corrections are logged transparently; no UI changes needed. The system compares each edit against the original raw_extraction JSON and only logs fields that actually differ from what the LLM produced.