Engine Routing Logic
How the ScraperRouter selects engines based on request mode, JS requirements, and crawl depth
Engine Routing Logic
The ScraperRouter deterministically selects which engine handles each request. No AI guessing — pure rule-based routing.
Input Shape (Normalized ScrapeRequest)
Every request becomes this shape before routing:
mode: "url" | "crawl" | "source_query" | "monitor"url: string (required for url/crawl/monitor)pages: int | nulldepth: int | nullfollow_links: bool (default false)js_mode: "auto" | "force_on" | "force_off"extract_prompt: string | nulloutput: "html" | "text" | "json"
Step 1: Non-Negotiable Guardrails (Run First)
These don't choose engines — they STOP the request:
robots_disallow = true→ status="blocked_robots"captcha_detected = true→ status="blocked_captcha"auth_wall_detected = true→ status="requires_auth"waf_detected = true→ status="blocked_waf"429_detected = true→ BullMQ backoff + domain cooldown403_detected = true→ STOP + escalate
Step 2: Cheap Preflight Heuristics
Lightweight probe (no heavy browser):
js_shell_score— empty body + heavy scripts = JS requiredpagination_signals— rel=next, ?page=, /page/2, "Next"list_density— repeated cards/rowscontent_ratio— main text vs DOM size
Router derives:
- js_required: true if js_mode="force_on" OR js_shell_score > threshold
- crawl_requested: true if mode="crawl" OR pages>1 OR depth>1 OR follow_links=true
- discovery_needed: true if follow_links=true OR pagination_signals=true
Step 3: Engine Selection
Source-Query Mode → JobSpy
If mode="source_query" OR source="job_board" → JobSpy
Monitor Mode → ChangeDetection
If mode="monitor" OR monitor_enabled=true → ChangeDetection.io
URL/Crawl Mode (Core Engines)
Single page fetch (no crawl):
- js_required=true → Playwright
- js_required=false → BeautifulSoup
Crawl requested (pages>1 or follow_links):
- JS required + crawl → Playwright crawl loop
- Static + discovery needed → Scrapy (link following, crawl frontier)
- Static + enumerated pages → Crawl4AI (known URL list)
Step 4: Extraction Layer (ScrapeGraphAI)
ScrapeGraphAI is an extraction step, NOT a crawler. Invoked when:
extract_prompt != nulloutput="json"AND schema provided- Structured fields requested from visual content
Pipeline: Crawler → content → ScrapeGraphAI → JSON
Step 5: Stealth Fallback (UndetectedChrome)
Only when:
bot_signal_detected=trueANDpolicy_mode != strict_stop
Route: Playwright → (if blocked) → UndetectedChrome. If still blocked → STOP.
Never use automatically. Never brute-force 403/captcha.
Summary Matrix
| Condition | Engine |
|---|---|
| source_query / job_board | JobSpy |
| monitor mode | ChangeDetection |
| Single page, no JS | BeautifulSoup |
| Single page, JS required | Playwright |
| Crawl, JS required | Playwright loop |
| Crawl, static, discovery | Scrapy |
| Crawl, static, enumerated | Crawl4AI |
| Extraction needed | ScrapeGraphAI (post-crawl) |
| Bot detection fallback | UndetectedChrome |
Last updated today
Built with Documentation.AI