Scraping Pipeline
JobSpy Engine
Job board scraping via python-jobspy with query expansion, relevance filtering, and pagination
JobSpy Engine
VPS: 76.13.123.120:8889 | Container: jobspy-worker-ao4kgkkc8gsss0wocscc040w
How It Works
Frontend → POST /api/v1/scraping/scrape (mode=source_query)
→ Backend: ScrapingService → JobspyEngine.run()
→ _get_one_proxy_url_from_pool() (Webshare residential)
→ _build_vps_payload() (normalize params, green-light keys only)
→ POST http://76.13.123.120:8889/scrape
→ VPS Worker runs python-jobspy locally
→ Returns {success, count, data}
Key Features
Query Expansion (Fix 3)
"VP of Marketing" expands to 9 variants: VP Marketing, Vice President Marketing, Vice President of Marketing, Head of Marketing, Marketing Director, Director of Marketing, Chief Marketing Officer, CMO. Each variant gets up to 3 LinkedIn pages.
Relevance Scoring (Fix 4)
Two-layer filter:
- Domain keyword scoring: If title contains core search keyword → score = 0.6
- Executive threshold: 0.35 for executive searches (vs 0.2 default)
Result: 0% noise. Every returned title is relevant.
Pagination (Fix 5)
per_query_wanted = max(results_wanted // len(queries), 25)- Minimum 25 per variant ensures multi-page LinkedIn results
- Before fix: 9 variants × 1 page = 50 raw. After: 9 × 3 pages = 89 raw
Provider Status
| Provider | Status | Notes |
|---|---|---|
| Working | 3 pages per variant, best results | |
| Indeed | Working | Good for broad searches, weak for executive |
| Glassdoor | Disabled | Always 403, behind feature flag |
| ZipRecruiter | Geo-blocked | EU proxy → GDPR 403, returns status="geo_blocked" |
Test Results (VP Marketing, Canada)
- Before fixes: 18 results (1 LinkedIn page only)
- After fixes: 37 filtered / 89 raw (9 variants × 3 pages)
- 7/7 test suite passing consistently
Backend Safety
- Proxy always injected (never exposes VPS IP)
- Only green-light params sent: search_term, location, is_remote, results_wanted, hours_old, job_type, site_names
- Default: search_term="software engineer", location="Remote", results_wanted=10
Was this page helpful?
Last updated today
Built with Documentation.AI