Investigations & Fixes

Crawl4AI Issues

Bugs investigated and fixed in the Crawl4AI scraping engine

Crawl4AI Issues

Sitemap XML Crawled as Pages

Problem: WordPress sites include <link rel="sitemap"> in HTML. Crawl4AI crawled them as pages — raw XML in output. Fix: Added .xml/.rss/.atom to skip filter. Rewrote sitemap parser. Added Cloudflare browser fallback.

Airtable Iframe URLs at Wrong Position

Problem: Markdown extraction appended iframe content at document end. Fix: Extract Airtable URLs from iframe src and inject at iframe location.

LinkedIn URL Drop (mytablon)

Problem: LinkedIn URLs inconsistent (542-1639 profiles). DOM recycling on infinite scroll.

Proxy Scoring 120s Hang

Problem: Scoring 500 proxies = 500 DB queries = 120s hang. Fix: Sample 20 random proxies instead of all 500.

Render 60-Minute Timeout

Problem: Render hardcoded 60-min timeout kills long jobs. MyTablon: 60.4 min. Root cause: DB pool recycles, VPS logger swallows CancelledError, wrong Supabase pooler. Workaround: Keep under 58 min. Needs RQ Workers.

Was this page helpful?

Last updated today

Built with Documentation.AI