Scraping Pipeline
Crawl4AI Engine
AI-powered deep crawling with Playwright, infinite scroll, multi-page, and screenshot support
Crawl4AI Engine
VPS: 76.13.123.120:8896 | Version: 2.1.0
Capabilities
- Deep crawl with BFS/DFS link following
- Infinite scroll handling (configurable scroll_delay, max_scroll_count)
- Multi-page crawling with seed URL extraction
- Screenshot capture via Playwright (
POST /screenshot) - Sitemap XML parsing with recursive index support
- Cloudflare bypass (browser fallback for JS challenges)
- Airtable iframe URL extraction
Key Fixes Applied
Screenshot Endpoint (v2.1.0)
Added POST /screenshot for fresh Playwright screenshots. Before: used ChangeDetection.io's stale static files. After: real-time screenshot at moment of request.
Sitemap Fix
- Added
.xml/.rss/.atomto URL skip filter (WordPress sitemaps were being crawled as pages) - Rewrote
fetch_sitemap_urls()to distinguish<sitemapindex>from<urlset> - Added Cloudflare browser fallback for sitemap fetching
Airtable Iframe Extraction
URLs from <iframe src="https://airtable.com/embed/..."> now extracted and positioned at iframe location in markdown (not appended at end).
Quality Criteria (NON-NEGOTIABLE)
- ecommerceranker.com: ≥300 Airtable store URLs
- mytablon.com/club/: ≥80 linkedin.com/in/ URLs
- All 16 test combos (4 sites × 4 param combos): PASS
VPS Worker Config
- Skip patterns: images, docs, media, CSS/JS, XML/RSS/atom
- Max pages per crawl: configurable
- Proxy support: pass-through from backend
Was this page helpful?
Last updated today
Built with Documentation.AI