Scraping PipelineChangeDetection Engine
Scraping Pipeline

ChangeDetection Engine

Website monitoring with before/after screenshots, text diffs, and webhook-triggered recurring jobs

ChangeDetection Engine

VPS: 31.97.43.159:5000 | API Key: stored in .env | Playwright: browserless/chrome on same Docker network

Architecture

Backend triggers CD.io watch → CD.io detects change → Webhook fires → Backend creates ScrapeJob

Crawl4AI takes fresh screenshot (POST /screenshot)

Text diff computed from CD.io history snapshots

Before/after screenshots uploaded to Supabase Storage

Job visible in UI with diff viewer

Key Features

  • Before/after screenshots: Fresh Playwright screenshots via Crawl4AI VPS
  • Text diffs: Unified diff with color-coded added/removed lines
  • Supabase Storage: Screenshots stored in screenshots bucket, URLs in DB metadata
  • Screenshot file server: Port 5001, systemd service on CD VPS

Fixes Applied

Screenshot Race Condition

When webhook jobs fire every ~60s, previous job may not have finished uploading screenshot. Fix: 3-tier lookup:

  1. In-memory cache (instant, race-free)
  2. Supabase Storage baseline.png (may be stale)
  3. DB lookup from previous completed job

False "No Changes" Bug

Character similarity was 96.3% on Wikipedia (4000 static lines, 50 dynamic). Threshold was 95%. Fix: Always compute unified diff — line count determines change, not similarity ratio.

Screenshot-Text Mismatch (PARTIALLY FIXED)

  • Fixed for manual triggers (commit a636bbc): added screenshot_after_url column
  • Still broken for webhook-triggered recurring jobs (store screenshot=NO)
  • Open investigation: webhook code path in webhooks.py