OCR Pipeline

OCR Architecture

6 OCR engines with intelligent routing, multi-page PDF processing, and normalizer pipeline

OCR Architecture

Engines

Engine	Speed	Accuracy	Best For	Cost
PaddleOCR	Fast	Good	General images, small PDFs	Free
DocTR	Medium	Excellent	PDFs with tables, structured docs	Free
Tesseract	Fast	Good	Simple text, fallback	Free
Google Vision	Slow	Excellent	Complex docs, large files	Paid API
Omniparser	Medium	Good	Available, uses Tesseract fallback	Free
CSV Loader	Very Fast	Perfect	CSV files (text parsing, not OCR)	Free

Note: EasyOCR is installed but not actively used (dependency conflicts).

Data Flow

File Upload (PDF/PNG/JPEG/CSV)
  ↓
OCR Router (selects engine based on file type, size, content)
  ↓
Engine processes file → returns text + confidence
  ↓
Normalizer → SynthralDocument
  ↓
Output: PDF/DOCX/PPTX/RAG/Agents/UI

API Endpoints

POST /api/v1/ocr/upload — Upload file for async OCR processing
POST /api/v1/ocr/extract-sync — Synchronous extraction
GET /api/v1/ocr/jobs — List jobs (user-filtered)

Multi-Page PDF Processing

Bug found and fixed: Multi-page PDFs silently discarded pages 2-10 (only processed page 1).

Root cause: _prepare_image_data() only saved images[0]
Fix: ThreadPoolExecutor with 5 workers (3x speedup)
Format: "--- Page N ---" separators with confidence per page
Performance: 10 pages in 6-10s (vs 20-30s sequential)

Security Fix

Critical: OCR jobs were visible to ALL users (missing user_id field).

Added user_id field to OCRJob model (nullable for backward compat)
Jobs now filtered by user — other users can't see your OCR results

Status Display Issue

Jobs show "completed" immediately, never show "running" — race condition where job completes in <1s but UI polls every 5s. Fix: Optimistic UI update + 1s polling when processing active.

Was this page helpful?

Last updated today

Built with Documentation.AI