Scraping Pipeline

ChangeDetection Engine

Website monitoring with before/after screenshots, text diffs, and webhook-triggered recurring jobs

ChangeDetection Engine

VPS: 31.97.43.159:5000 | API Key: stored in .env | Playwright: browserless/chrome on same Docker network

Architecture

Backend triggers CD.io watch → CD.io detects change → Webhook fires → Backend creates ScrapeJob
  ↓
Crawl4AI takes fresh screenshot (POST /screenshot)
  ↓
Text diff computed from CD.io history snapshots
  ↓
Before/after screenshots uploaded to Supabase Storage
  ↓
Job visible in UI with diff viewer

Key Features

Before/after screenshots: Fresh Playwright screenshots via Crawl4AI VPS
Text diffs: Unified diff with color-coded added/removed lines
Supabase Storage: Screenshots stored in screenshots bucket, URLs in DB metadata
Screenshot file server: Port 5001, systemd service on CD VPS

Fixes Applied

Screenshot Race Condition

When webhook jobs fire every ~60s, previous job may not have finished uploading screenshot. Fix: 3-tier lookup:

In-memory cache (instant, race-free)
Supabase Storage baseline.png (may be stale)
DB lookup from previous completed job

False "No Changes" Bug

Character similarity was 96.3% on Wikipedia (4000 static lines, 50 dynamic). Threshold was 95%. Fix: Always compute unified diff — line count determines change, not similarity ratio.

Screenshot-Text Mismatch (PARTIALLY FIXED)

Fixed for manual triggers (commit a636bbc): added screenshot_after_url column
Still broken for webhook-triggered recurring jobs (store screenshot=NO)
Open investigation: webhook code path in webhooks.py

Was this page helpful?

Last updated today

Built with Documentation.AI