Spidr : Ruby开发的Web爬虫

jopen 12年前

Spidr : Ruby开发的Web爬虫
Spidr是一个多功能的Ruby web 爬虫库。它可以抓取一个网站，多个域名或某些链接。Spidr被设计成快速和容易使用。

Follows:
- a tags.
- iframe tags.
- frame tags.
- Cookie protected links.
- HTTP 300, 301, 302, 303 and 307 Redirects.
- HTTP Basic Auth protected links.
Black-list or white-list URLs based upon:
- URL scheme
- Host name
- Port number
- Full link
- URL extension
Provides call-backs for:
- Every visited Page.
- Every visited URL.
- Every visited URL that matches a specified pattern.
- Every URL that failed to be visited.
Provides action methods to:
- Pause spidering.
- Skip processing of pages.
- Skip processing of links.
Restore the spidering queue and history from a previous session.
Custom User-Agent strings.
Custom proxy settings.
HTTPS support.