Scrapers
Reference list of tools for extracting structured data from websites — from single-page fetch to large-scale crawl pipelines.
Source
- cassidoo/scrapers — original curated collection
Python
| Library | Notes |
|---|---|
| Beautiful Soup | HTML/XML parsing; pairs with requests |
| Scrapy | Full crawl framework, pipelines, middlewares |
| Playwright | Headless browser automation |
| Selenium | Browser automation (heavier, widely used) |
| httpx | Async HTTP client |
| lxml | Fast XML/HTML parsing |
Node.js
| Library | Notes |
|---|---|
| Cheerio | jQuery-like server-side HTML |
| Puppeteer | Chrome DevTools Protocol automation |
| Playwright | Cross-browser automation |
Other
| Tool | Notes |
|---|---|
| Colly | Go crawling framework |
| Mechanize | Ruby stateful browsing |
| wget / curl | Simple fetch and mirror |
Practices
- Respect
robots.txtand site Terms of Service - Rate-limit and identify your bot with a contact User-Agent
- Prefer APIs and official feeds when available
- Cache responses; use incremental crawls for large sites