Scraping PipelineScraping Infrastructure (engines, VPS, proxy)
Scraping Pipeline

Scraping Infrastructure

10-engine scraping architecture with 2 VPS servers and residential proxy pool

Scraping Infrastructure

Overall Compliance: ~90% (against SYC Web Scraping Guide)

Engine Arsenal (10 Configured + 4 Available)

Active Engines

#EngineTypeUse CaseVPS/Config
1Crawl4AIAI-powered browserDeep crawl, infinite scroll, AI extraction76.13.123.120:8896
2ScrapySpider frameworkMulti-page crawling with pipelines76.13.123.120:8892
3PlaywrightBrowser automationModern headless browser (Level 3-4)76.13.123.120:8890
4BeautifulSoupHTTP parserSimple HTML parsing (Level 1-2)76.13.123.120:8894
5JobSpyDomain-specificJob board scraping (Indeed, LinkedIn, etc.)76.13.123.120:8889
6ScrapeGraphAI-poweredGraph-based AI extraction76.13.123.120:8891
7UndetectedChromeStealth browserUndetected Chrome for anti-bot (Level 4)76.13.123.120:8893
8WaterCrawlBrowser automationSelenium-based (Level 3)31.97.43.159:9005
9ChangeDetectionMonitoringChange detection and screenshots31.97.43.159:5000
10CSVLoaderData loaderCSV/structured data importLocal (no VPS)

Available But Not Configured

EngineMissing ConfigValue
ScraperAPISCRAPERAPI_API_KEYHigh (CAPTCHA solving)
BrowserbaseBROWSERBASE_API_KEYMedium (managed browser)

Not Needed (Redundant)

  • Puppeteer: Redundant — Playwright does everything better
  • Cloudscraper: Redundant — UndetectedChrome handles Cloudflare

Protection Level Coverage

LevelProtection TypeCoverageTools Used
Level 1None / BasicExcellentBeautifulSoup, httpx
Level 2Rate LimitingExcellentAll engines + proxy rotation
Level 3Browser FingerprintingGoodCrawl4AI (Playwright), WaterCrawl
Level 4Advanced Bot DetectionPartialLimited (no CAPTCHA, no mobile proxies)

Architecture Flow

Frontend UI

Backend API (ScraperRouter)

[Auto-detect engine based on requirements]

Engine (Crawl4AI / JobSpy / WaterCrawl / BeautifulSoup)

Proxy Pool (Webshare residential)

Target Site

Normalized Data → Database

Proxy Infrastructure (Webshare.io)

MetricImplementation
TypeResidential (rotating)
Cost~$7-10/GB
RotationPer-request
CountriesUS, CA, GB only
Pool Size500 proxies max
Gatewayp.webshare.io

Cost Analysis

ScenarioCost
1,000 job listings (JobSpy, text-heavy)$0.40
Deep crawl of 2,459 LinkedIn profiles (Crawl4AI)$1.60
Cost per profile (worst case)$0.0006

Gaps

  • No CAPTCHA solving service (blocks Level 4 sites)
  • No mobile proxy tier (limits highest-protection targets)
  • No automated protection level detection
  • No cost tracking dashboard