Scraping PipelineJobSpy Engine
Scraping Pipeline

JobSpy Engine

Job board scraping via python-jobspy with query expansion, relevance filtering, and pagination

JobSpy Engine

VPS: 76.13.123.120:8889 | Container: jobspy-worker-ao4kgkkc8gsss0wocscc040w

How It Works

Frontend → POST /api/v1/scraping/scrape (mode=source_query)
  → Backend: ScrapingService → JobspyEngine.run()
    → _get_one_proxy_url_from_pool() (Webshare residential)
    → _build_vps_payload() (normalize params, green-light keys only)
    → POST http://76.13.123.120:8889/scrape
  → VPS Worker runs python-jobspy locally
  → Returns {success, count, data}

Key Features

Query Expansion (Fix 3)

"VP of Marketing" expands to 9 variants: VP Marketing, Vice President Marketing, Vice President of Marketing, Head of Marketing, Marketing Director, Director of Marketing, Chief Marketing Officer, CMO. Each variant gets up to 3 LinkedIn pages.

Relevance Scoring (Fix 4)

Two-layer filter:

  1. Domain keyword scoring: If title contains core search keyword → score = 0.6
  2. Executive threshold: 0.35 for executive searches (vs 0.2 default)

Result: 0% noise. Every returned title is relevant.

Pagination (Fix 5)

  • per_query_wanted = max(results_wanted // len(queries), 25)
  • Minimum 25 per variant ensures multi-page LinkedIn results
  • Before fix: 9 variants × 1 page = 50 raw. After: 9 × 3 pages = 89 raw

Provider Status

ProviderStatusNotes
LinkedInWorking3 pages per variant, best results
IndeedWorkingGood for broad searches, weak for executive
GlassdoorDisabledAlways 403, behind feature flag
ZipRecruiterGeo-blockedEU proxy → GDPR 403, returns status="geo_blocked"

Test Results (VP Marketing, Canada)

  • Before fixes: 18 results (1 LinkedIn page only)
  • After fixes: 37 filtered / 89 raw (9 variants × 3 pages)
  • 7/7 test suite passing consistently

Backend Safety

  • Proxy always injected (never exposes VPS IP)
  • Only green-light params sent: search_term, location, is_remote, results_wanted, hours_old, job_type, site_names
  • Default: search_term="software engineer", location="Remote", results_wanted=10