Scraping Pipeline

Scraping Infrastructure

10-engine scraping architecture with 2 VPS servers and residential proxy pool

Scraping Infrastructure

Overall Compliance: ~90% (against SYC Web Scraping Guide)

Engine Arsenal (10 Configured + 4 Available)

Active Engines

#	Engine	Type	Use Case	VPS/Config
1	Crawl4AI	AI-powered browser	Deep crawl, infinite scroll, AI extraction	`76.13.123.120:8896`
2	Scrapy	Spider framework	Multi-page crawling with pipelines	`76.13.123.120:8892`
3	Playwright	Browser automation	Modern headless browser (Level 3-4)	`76.13.123.120:8890`
4	BeautifulSoup	HTTP parser	Simple HTML parsing (Level 1-2)	`76.13.123.120:8894`
5	JobSpy	Domain-specific	Job board scraping (Indeed, LinkedIn, etc.)	`76.13.123.120:8889`
6	ScrapeGraph	AI-powered	Graph-based AI extraction	`76.13.123.120:8891`
7	UndetectedChrome	Stealth browser	Undetected Chrome for anti-bot (Level 4)	`76.13.123.120:8893`
8	WaterCrawl	Browser automation	Selenium-based (Level 3)	`31.97.43.159:9005`
9	ChangeDetection	Monitoring	Change detection and screenshots	`31.97.43.159:5000`
10	CSVLoader	Data loader	CSV/structured data import	Local (no VPS)

Available But Not Configured

Engine	Missing Config	Value
ScraperAPI	`SCRAPERAPI_API_KEY`	High (CAPTCHA solving)
Browserbase	`BROWSERBASE_API_KEY`	Medium (managed browser)

Not Needed (Redundant)

Puppeteer: Redundant — Playwright does everything better
Cloudscraper: Redundant — UndetectedChrome handles Cloudflare

Protection Level Coverage

Level	Protection Type	Coverage	Tools Used
Level 1	None / Basic	Excellent	BeautifulSoup, httpx
Level 2	Rate Limiting	Excellent	All engines + proxy rotation
Level 3	Browser Fingerprinting	Good	Crawl4AI (Playwright), WaterCrawl
Level 4	Advanced Bot Detection	Partial	Limited (no CAPTCHA, no mobile proxies)

Architecture Flow

Frontend UI
  ↓
Backend API (ScraperRouter)
  ↓
[Auto-detect engine based on requirements]
  ↓
Engine (Crawl4AI / JobSpy / WaterCrawl / BeautifulSoup)
  ↓
Proxy Pool (Webshare residential)
  ↓
Target Site
  ↓
Normalized Data → Database

Proxy Infrastructure (Webshare.io)

Metric	Implementation
Type	Residential (rotating)
Cost	~$7-10/GB
Rotation	Per-request
Countries	US, CA, GB only
Pool Size	500 proxies max
Gateway	`p.webshare.io`

Cost Analysis

Scenario	Cost
1,000 job listings (JobSpy, text-heavy)	$0.40
Deep crawl of 2,459 LinkedIn profiles (Crawl4AI)	$1.60
Cost per profile (worst case)	$0.0006

Gaps

No CAPTCHA solving service (blocks Level 4 sites)
No mobile proxy tier (limits highest-protection targets)
No automated protection level detection
No cost tracking dashboard

Was this page helpful?

Last updated today

Built with Documentation.AI