OCR Pipeline
OCR Architecture
6 OCR engines with intelligent routing, multi-page PDF processing, and normalizer pipeline
OCR Architecture
Engines
| Engine | Speed | Accuracy | Best For | Cost |
|---|---|---|---|---|
| PaddleOCR | Fast | Good | General images, small PDFs | Free |
| DocTR | Medium | Excellent | PDFs with tables, structured docs | Free |
| Tesseract | Fast | Good | Simple text, fallback | Free |
| Google Vision | Slow | Excellent | Complex docs, large files | Paid API |
| Omniparser | Medium | Good | Available, uses Tesseract fallback | Free |
| CSV Loader | Very Fast | Perfect | CSV files (text parsing, not OCR) | Free |
Note: EasyOCR is installed but not actively used (dependency conflicts).
Data Flow
File Upload (PDF/PNG/JPEG/CSV)
↓
OCR Router (selects engine based on file type, size, content)
↓
Engine processes file → returns text + confidence
↓
Normalizer → SynthralDocument
↓
Output: PDF/DOCX/PPTX/RAG/Agents/UI
API Endpoints
POST /api/v1/ocr/upload— Upload file for async OCR processingPOST /api/v1/ocr/extract-sync— Synchronous extractionGET /api/v1/ocr/jobs— List jobs (user-filtered)
Multi-Page PDF Processing
Bug found and fixed: Multi-page PDFs silently discarded pages 2-10 (only processed page 1).
- Root cause:
_prepare_image_data()only savedimages[0] - Fix: ThreadPoolExecutor with 5 workers (3x speedup)
- Format: "--- Page N ---" separators with confidence per page
- Performance: 10 pages in 6-10s (vs 20-30s sequential)
Security Fix
Critical: OCR jobs were visible to ALL users (missing user_id field).
- Added
user_idfield to OCRJob model (nullable for backward compat) - Jobs now filtered by user — other users can't see your OCR results
Status Display Issue
Jobs show "completed" immediately, never show "running" — race condition where job completes in <1s but UI polls every 5s. Fix: Optimistic UI update + 1s polling when processing active.
Was this page helpful?
Last updated today
Built with Documentation.AI