Scraping PipelineCrawl4AI Engine
Scraping Pipeline

Crawl4AI Engine

AI-powered deep crawling with Playwright, infinite scroll, multi-page, and screenshot support

Crawl4AI Engine

VPS: 76.13.123.120:8896 | Version: 2.1.0

Capabilities

  • Deep crawl with BFS/DFS link following
  • Infinite scroll handling (configurable scroll_delay, max_scroll_count)
  • Multi-page crawling with seed URL extraction
  • Screenshot capture via Playwright (POST /screenshot)
  • Sitemap XML parsing with recursive index support
  • Cloudflare bypass (browser fallback for JS challenges)
  • Airtable iframe URL extraction

Key Fixes Applied

Screenshot Endpoint (v2.1.0)

Added POST /screenshot for fresh Playwright screenshots. Before: used ChangeDetection.io's stale static files. After: real-time screenshot at moment of request.

Sitemap Fix

  • Added .xml/.rss/.atom to URL skip filter (WordPress sitemaps were being crawled as pages)
  • Rewrote fetch_sitemap_urls() to distinguish <sitemapindex> from <urlset>
  • Added Cloudflare browser fallback for sitemap fetching

Airtable Iframe Extraction

URLs from <iframe src="https://airtable.com/embed/..."> now extracted and positioned at iframe location in markdown (not appended at end).

Quality Criteria (NON-NEGOTIABLE)

  • ecommerceranker.com: ≥300 Airtable store URLs
  • mytablon.com/club/: ≥80 linkedin.com/in/ URLs
  • All 16 test combos (4 sites × 4 param combos): PASS

VPS Worker Config

  • Skip patterns: images, docs, media, CSS/JS, XML/RSS/atom
  • Max pages per crawl: configurable
  • Proxy support: pass-through from backend