OCR PipelineEngine Selection
OCR Pipeline

OCR Engine Selection

Routing logic for selecting the right OCR engine based on file type, size, and content

OCR Engine Selection

Selection Logic

PDF Files

  • Has text layer (textual PDF): Extract text directly — no OCR needed
  • Scanned PDF with tables: Use DocTR
  • Scanned PDF, no tables, >5MB: Use Google Vision
  • Scanned PDF, no tables, <5MB: Use PaddleOCR

Image Files (PNG, JPEG)

  • >10MB: Use Google Vision (cloud handles large images)
  • <10MB: Use PaddleOCR (fast default)
  • Simple text fallback: Use Tesseract

CSV Files

  • Use CSV Loader (deterministic parsing, confidence = 1.0)

Manual Override

User can specify engine explicitly in the request. If specified engine is unavailable, returns 400 with list of available engines.

Priority Order (Auto-detect)

DocTR → EasyOCR → Google Vision → PaddleOCR → Tesseract → Omniparser

Response Format

{
  "text": "Extracted text content...",
  "confidence": 0.95,
  "engine_used": "paddleocr"
}

Configuration

  • Google Vision requires GOOGLE_VISION_API_KEY env var
  • PaddleOCR runs on VPS at 76.13.123.120:8866
  • DocTR runs on VPS via mindee/doctr Docker image
  • Tesseract optional — not installed by default on Render