Node.js Web 爬虫:Node Osmosis
n6xb
10年前
Osmosis 是 Node.js 用来解析 HTML/XML 和 Web 内容爬取的扩展。
Features
- Fast: uses libxml C bindings
- Lightweight: no dependencies like jQuery, cheerio, or jsdom
- Clean: promise based interface- no more nested callbacks
- Flexible: supports both CSS and XPath selectors
- Predictable: same input, same output, same order
- Detailed logging for every step
- Precise and natural IO flow- no setTimeout or process.nextTick
- Easy debugging with built-in stack size and memory usage reporting
- Memory leak free
Example: scrape all craigslist listings
var osmosis = require('osmosis'); osmosis .get('www.craigslist.org/about/sites') .find('h1 + div a') .set('location') .follow('@href') .find('header + div + div li > a') .set('category') .follow('@href') .find('p > a', '.totallink + a.button.next:first') .follow('@href') .set({ 'title': 'section > h2', 'description': '#postingbody', 'subcategory': 'div.breadbox > span[4]', 'date': 'time@datetime', 'latitude': '#map@data-latitude', 'longitude': '#map@data-longitude', 'images[]': 'img@src' }) .data(function(listing) { // do something with listing data })