Spidr : Ruby开发的Web爬虫

jopen 12年前

Spidr : Ruby开发的Web爬虫
Spidr是一个多功能的Ruby web 爬虫库。它可以抓取一个网站,多个域名或某些链接。Spidr被设计成快速和容易使用。

具体特性:

  • Follows:
    • a tags.
    • iframe tags.
    • frame tags.
    • Cookie protected links.
    • HTTP 300, 301, 302, 303 and 307 Redirects.
    • HTTP Basic Auth protected links.
  • Black-list or white-list URLs based upon:
    • URL scheme
    • Host name
    • Port number
    • Full link
    • URL extension
  • Provides call-backs for:
    • Every visited Page.
    • Every visited URL.
    • Every visited URL that matches a specified pattern.
    • Every URL that failed to be visited.
  • Provides action methods to:
    • Pause spidering.
    • Skip processing of pages.
    • Skip processing of links.
  • Restore the spidering queue and history from a previous session.
  • Custom User-Agent strings.
  • Custom proxy settings.
  • HTTPS support.

项目主页:http://www.open-open.com/lib/view/home/1349945908431