一个可配置的,可扩展的PHP网页蜘蛛:PHP-Spider
jopen
10年前
PHP-Spider是一个可配置的,可扩展的PHP网页蜘蛛。
PHP-Spider Features
- supports two traversal algorithms: breadth-first and depth-first
- supports depth limiting and queue size limiting
- supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
- comes with a useful set of URI filters, such as Domain limiting
- supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
- supports custom request handling logic
- comes with a useful set of persistence handlers (memory, file. Redis soon to follow)
- supports custom persistence handlers
- collects statistics about the crawl for reporting
- dispatches useful events, allowing developers to add even more custom behavior
- supports a politeness policy
- will soon come with many default discoverers: RSS, Atom, RDF, etc.
- will soon support multiple queueing mechanisms (file, memcache, redis)
- will eventually support distributed spidering with a central queue