Scraping Pipeline
ChangeDetection Engine
Website monitoring with before/after screenshots, text diffs, and webhook-triggered recurring jobs
ChangeDetection Engine
VPS: 31.97.43.159:5000 | API Key: stored in .env | Playwright: browserless/chrome on same Docker network
Architecture
Backend triggers CD.io watch → CD.io detects change → Webhook fires → Backend creates ScrapeJob
↓
Crawl4AI takes fresh screenshot (POST /screenshot)
↓
Text diff computed from CD.io history snapshots
↓
Before/after screenshots uploaded to Supabase Storage
↓
Job visible in UI with diff viewer
Key Features
- Before/after screenshots: Fresh Playwright screenshots via Crawl4AI VPS
- Text diffs: Unified diff with color-coded added/removed lines
- Supabase Storage: Screenshots stored in
screenshotsbucket, URLs in DB metadata - Screenshot file server: Port 5001, systemd service on CD VPS
Fixes Applied
Screenshot Race Condition
When webhook jobs fire every ~60s, previous job may not have finished uploading screenshot. Fix: 3-tier lookup:
- In-memory cache (instant, race-free)
- Supabase Storage baseline.png (may be stale)
- DB lookup from previous completed job
False "No Changes" Bug
Character similarity was 96.3% on Wikipedia (4000 static lines, 50 dynamic). Threshold was 95%. Fix: Always compute unified diff — line count determines change, not similarity ratio.
Screenshot-Text Mismatch (PARTIALLY FIXED)
- Fixed for manual triggers (commit a636bbc): added
screenshot_after_urlcolumn - Still broken for webhook-triggered recurring jobs (store
screenshot=NO) - Open investigation: webhook code path in
webhooks.py
Was this page helpful?
Last updated today
Built with Documentation.AI