Crawl4AI Issues
Bugs investigated and fixed in the Crawl4AI scraping engine
Crawl4AI Issues
Sitemap XML Crawled as Pages
Problem: WordPress sites include <link rel="sitemap"> in HTML. Crawl4AI crawled them as pages — raw XML in output.
Fix: Added .xml/.rss/.atom to skip filter. Rewrote sitemap parser. Added Cloudflare browser fallback.
Airtable Iframe URLs at Wrong Position
Problem: Markdown extraction appended iframe content at document end. Fix: Extract Airtable URLs from iframe src and inject at iframe location.
LinkedIn URL Drop (mytablon)
Problem: LinkedIn URLs inconsistent (542-1639 profiles). DOM recycling on infinite scroll.
Proxy Scoring 120s Hang
Problem: Scoring 500 proxies = 500 DB queries = 120s hang. Fix: Sample 20 random proxies instead of all 500.
Render 60-Minute Timeout
Problem: Render hardcoded 60-min timeout kills long jobs. MyTablon: 60.4 min. Root cause: DB pool recycles, VPS logger swallows CancelledError, wrong Supabase pooler. Workaround: Keep under 58 min. Needs RQ Workers.
Last updated today
Built with Documentation.AI