OCR PipelineOCR Architecture
OCR Pipeline

OCR Architecture

6 OCR engines with intelligent routing, multi-page PDF processing, and normalizer pipeline

OCR Architecture

Engines

EngineSpeedAccuracyBest ForCost
PaddleOCRFastGoodGeneral images, small PDFsFree
DocTRMediumExcellentPDFs with tables, structured docsFree
TesseractFastGoodSimple text, fallbackFree
Google VisionSlowExcellentComplex docs, large filesPaid API
OmniparserMediumGoodAvailable, uses Tesseract fallbackFree
CSV LoaderVery FastPerfectCSV files (text parsing, not OCR)Free

Note: EasyOCR is installed but not actively used (dependency conflicts).

Data Flow

File Upload (PDF/PNG/JPEG/CSV)

OCR Router (selects engine based on file type, size, content)

Engine processes file → returns text + confidence

Normalizer → SynthralDocument

Output: PDF/DOCX/PPTX/RAG/Agents/UI

API Endpoints

  • POST /api/v1/ocr/upload — Upload file for async OCR processing
  • POST /api/v1/ocr/extract-sync — Synchronous extraction
  • GET /api/v1/ocr/jobs — List jobs (user-filtered)

Multi-Page PDF Processing

Bug found and fixed: Multi-page PDFs silently discarded pages 2-10 (only processed page 1).

  • Root cause: _prepare_image_data() only saved images[0]
  • Fix: ThreadPoolExecutor with 5 workers (3x speedup)
  • Format: "--- Page N ---" separators with confidence per page
  • Performance: 10 pages in 6-10s (vs 20-30s sequential)

Security Fix

Critical: OCR jobs were visible to ALL users (missing user_id field).

  • Added user_id field to OCRJob model (nullable for backward compat)
  • Jobs now filtered by user — other users can't see your OCR results

Status Display Issue

Jobs show "completed" immediately, never show "running" — race condition where job completes in <1s but UI polls every 5s. Fix: Optimistic UI update + 1s polling when processing active.