一个可配置的,可扩展的PHP网页蜘蛛:PHP-Spider

jopen 10年前

PHP-Spider是一个可配置的,可扩展的PHP网页蜘蛛。

PHP-Spider Features

  • supports two traversal algorithms: breadth-first and depth-first
  • supports depth limiting and queue size limiting
  • supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
  • comes with a useful set of URI filters, such as Domain limiting
  • supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
  • supports custom request handling logic
  • comes with a useful set of persistence handlers (memory, file. Redis soon to follow)
  • supports custom persistence handlers
  • collects statistics about the crawl for reporting
  • dispatches useful events, allowing developers to add even more custom behavior
  • supports a politeness policy
  • will soon come with many default discoverers: RSS, Atom, RDF, etc.
  • will soon support multiple queueing mechanisms (file, memcache, redis)
  • will eventually support distributed spidering with a central queue

项目主页:http://www.open-open.com/lib/view/home/1399025018796