Scraping PipelineEngine Routing Logic
Scraping Pipeline

Engine Routing Logic

How the ScraperRouter selects engines based on request mode, JS requirements, and crawl depth

Engine Routing Logic

The ScraperRouter deterministically selects which engine handles each request. No AI guessing — pure rule-based routing.

Input Shape (Normalized ScrapeRequest)

Every request becomes this shape before routing:

  • mode: "url" | "crawl" | "source_query" | "monitor"
  • url: string (required for url/crawl/monitor)
  • pages: int | null
  • depth: int | null
  • follow_links: bool (default false)
  • js_mode: "auto" | "force_on" | "force_off"
  • extract_prompt: string | null
  • output: "html" | "text" | "json"

Step 1: Non-Negotiable Guardrails (Run First)

These don't choose engines — they STOP the request:

  • robots_disallow = true → status="blocked_robots"
  • captcha_detected = true → status="blocked_captcha"
  • auth_wall_detected = true → status="requires_auth"
  • waf_detected = true → status="blocked_waf"
  • 429_detected = true → BullMQ backoff + domain cooldown
  • 403_detected = true → STOP + escalate

Step 2: Cheap Preflight Heuristics

Lightweight probe (no heavy browser):

  • js_shell_score — empty body + heavy scripts = JS required
  • pagination_signals — rel=next, ?page=, /page/2, "Next"
  • list_density — repeated cards/rows
  • content_ratio — main text vs DOM size

Router derives:

  • js_required: true if js_mode="force_on" OR js_shell_score > threshold
  • crawl_requested: true if mode="crawl" OR pages>1 OR depth>1 OR follow_links=true
  • discovery_needed: true if follow_links=true OR pagination_signals=true

Step 3: Engine Selection

Source-Query Mode → JobSpy

If mode="source_query" OR source="job_board"JobSpy

Monitor Mode → ChangeDetection

If mode="monitor" OR monitor_enabled=trueChangeDetection.io

URL/Crawl Mode (Core Engines)

Single page fetch (no crawl):

  • js_required=true → Playwright
  • js_required=false → BeautifulSoup

Crawl requested (pages>1 or follow_links):

  • JS required + crawl → Playwright crawl loop
  • Static + discovery needed → Scrapy (link following, crawl frontier)
  • Static + enumerated pages → Crawl4AI (known URL list)

Step 4: Extraction Layer (ScrapeGraphAI)

ScrapeGraphAI is an extraction step, NOT a crawler. Invoked when:

  • extract_prompt != null
  • output="json" AND schema provided
  • Structured fields requested from visual content

Pipeline: Crawler → content → ScrapeGraphAI → JSON

Step 5: Stealth Fallback (UndetectedChrome)

Only when:

  • bot_signal_detected=true AND
  • policy_mode != strict_stop

Route: Playwright → (if blocked) → UndetectedChrome. If still blocked → STOP.

Never use automatically. Never brute-force 403/captcha.

Summary Matrix

ConditionEngine
source_query / job_boardJobSpy
monitor modeChangeDetection
Single page, no JSBeautifulSoup
Single page, JS requiredPlaywright
Crawl, JS requiredPlaywright loop
Crawl, static, discoveryScrapy
Crawl, static, enumeratedCrawl4AI
Extraction neededScrapeGraphAI (post-crawl)
Bot detection fallbackUndetectedChrome