⌘K

⌘K

Scraping PipelineEngine Routing Logic

Scraping Pipeline

Engine Routing Logic

How the ScraperRouter selects engines based on request mode, JS requirements, and crawl depth

Engine Routing Logic

The ScraperRouter deterministically selects which engine handles each request. No AI guessing — pure rule-based routing.

Input Shape (Normalized ScrapeRequest)

Every request becomes this shape before routing:

mode: "url" | "crawl" | "source_query" | "monitor"
url: string (required for url/crawl/monitor)
pages: int | null
depth: int | null
follow_links: bool (default false)
js_mode: "auto" | "force_on" | "force_off"
extract_prompt: string | null
output: "html" | "text" | "json"

Step 1: Non-Negotiable Guardrails (Run First)

These don't choose engines — they STOP the request:

robots_disallow = true → status="blocked_robots"
captcha_detected = true → status="blocked_captcha"
auth_wall_detected = true → status="requires_auth"
waf_detected = true → status="blocked_waf"
429_detected = true → BullMQ backoff + domain cooldown
403_detected = true → STOP + escalate

Step 2: Cheap Preflight Heuristics

Lightweight probe (no heavy browser):

js_shell_score — empty body + heavy scripts = JS required
pagination_signals — rel=next, ?page=, /page/2, "Next"
list_density — repeated cards/rows
content_ratio — main text vs DOM size

Router derives:

js_required: true if js_mode="force_on" OR js_shell_score > threshold
crawl_requested: true if mode="crawl" OR pages>1 OR depth>1 OR follow_links=true
discovery_needed: true if follow_links=true OR pagination_signals=true

Step 3: Engine Selection

Source-Query Mode → JobSpy

If mode="source_query" OR source="job_board" → JobSpy

Monitor Mode → ChangeDetection

If mode="monitor" OR monitor_enabled=true → ChangeDetection.io

URL/Crawl Mode (Core Engines)

Single page fetch (no crawl):

js_required=true → Playwright
js_required=false → BeautifulSoup

Crawl requested (pages>1 or follow_links):

JS required + crawl → Playwright crawl loop
Static + discovery needed → Scrapy (link following, crawl frontier)
Static + enumerated pages → Crawl4AI (known URL list)

Step 4: Extraction Layer (ScrapeGraphAI)

ScrapeGraphAI is an extraction step, NOT a crawler. Invoked when:

extract_prompt != null
output="json" AND schema provided
Structured fields requested from visual content

Pipeline: Crawler → content → ScrapeGraphAI → JSON

Step 5: Stealth Fallback (UndetectedChrome)

Only when:

bot_signal_detected=true AND
policy_mode != strict_stop

Route: Playwright → (if blocked) → UndetectedChrome. If still blocked → STOP.

Never use automatically. Never brute-force 403/captcha.

Summary Matrix

Condition	Engine
source_query / job_board	JobSpy
monitor mode	ChangeDetection
Single page, no JS	BeautifulSoup
Single page, JS required	Playwright
Crawl, JS required	Playwright loop
Crawl, static, discovery	Scrapy
Crawl, static, enumerated	Crawl4AI
Extraction needed	ScrapeGraphAI (post-crawl)
Bot detection fallback	UndetectedChrome

Was this page helpful?

Last updated today

Built with Documentation.AI