Skip to content

Scrapers

Reference list of tools for extracting structured data from websites — from single-page fetch to large-scale crawl pipelines.

Source

Python

LibraryNotes
Beautiful SoupHTML/XML parsing; pairs with requests
ScrapyFull crawl framework, pipelines, middlewares
PlaywrightHeadless browser automation
SeleniumBrowser automation (heavier, widely used)
httpxAsync HTTP client
lxmlFast XML/HTML parsing

Node.js

LibraryNotes
CheeriojQuery-like server-side HTML
PuppeteerChrome DevTools Protocol automation
PlaywrightCross-browser automation

Other

ToolNotes
CollyGo crawling framework
MechanizeRuby stateful browsing
wget / curlSimple fetch and mirror

Practices

  • Respect robots.txt and site Terms of Service
  • Rate-limit and identify your bot with a contact User-Agent
  • Prefer APIs and official feeds when available
  • Cache responses; use incremental crawls for large sites

Curated technical notes — open source on GitHub