Scraping Pipeline

Crawl4AI Engine

AI-powered deep crawling with Playwright, infinite scroll, multi-page, and screenshot support

Crawl4AI Engine

VPS: 76.13.123.120:8896 | Version: 2.1.0

Capabilities

Deep crawl with BFS/DFS link following
Infinite scroll handling (configurable scroll_delay, max_scroll_count)
Multi-page crawling with seed URL extraction
Screenshot capture via Playwright (POST /screenshot)
Sitemap XML parsing with recursive index support
Cloudflare bypass (browser fallback for JS challenges)
Airtable iframe URL extraction

Key Fixes Applied

Screenshot Endpoint (v2.1.0)

Added POST /screenshot for fresh Playwright screenshots. Before: used ChangeDetection.io's stale static files. After: real-time screenshot at moment of request.

Sitemap Fix

Added .xml/.rss/.atom to URL skip filter (WordPress sitemaps were being crawled as pages)
Rewrote fetch_sitemap_urls() to distinguish <sitemapindex> from <urlset>
Added Cloudflare browser fallback for sitemap fetching

Airtable Iframe Extraction

URLs from <iframe src="https://airtable.com/embed/..."> now extracted and positioned at iframe location in markdown (not appended at end).

Quality Criteria (NON-NEGOTIABLE)

ecommerceranker.com: ≥300 Airtable store URLs
mytablon.com/club/: ≥80 linkedin.com/in/ URLs
All 16 test combos (4 sites × 4 param combos): PASS

VPS Worker Config

Skip patterns: images, docs, media, CSS/JS, XML/RSS/atom
Max pages per crawl: configurable
Proxy support: pass-through from backend

Was this page helpful?

Last updated today

Built with Documentation.AI