Scrapers

Reference list of tools for extracting structured data from websites — from single-page fetch to large-scale crawl pipelines.

Source

cassidoo/scrapers — original curated collection

Python

Library	Notes
Beautiful Soup	HTML/XML parsing; pairs with `requests`
Scrapy	Full crawl framework, pipelines, middlewares
Playwright	Headless browser automation
Selenium	Browser automation (heavier, widely used)
httpx	Async HTTP client
lxml	Fast XML/HTML parsing

Node.js

Library	Notes
Cheerio	jQuery-like server-side HTML
Puppeteer	Chrome DevTools Protocol automation
Playwright	Cross-browser automation

Other

Tool	Notes
Colly	Go crawling framework
Mechanize	Ruby stateful browsing
wget / curl	Simple fetch and mirror

Practices

Respect robots.txt and site Terms of Service
Rate-limit and identify your bot with a contact User-Agent
Prefer APIs and official feeds when available
Cache responses; use incremental crawls for large sites

Web Scraping