开源的搜索引擎,Nutch 1.9 发布
Nutch 是一个开放源代码(open-source)的Java搜索引擎包,它提供了构建一个搜索引擎所需要的全部工具和功能。使用Nutch不仅可以建立自己内 部网的搜索引擎,同时也可以针对整个网络建立搜索引擎。除了基本的功能之外,Nutch也还有不少自己的特色,如Map-Reduce、Hadoop、 Plugin等。
Nutch 从总体上看来,分为三个主要的部分:爬行、索引和搜索。Web db是Nutch初始运行的URL集合;Fetcher是用来抓取网页的爬行器,也就是平时常说的Crawler;indexer是用来建立索引的部分, 它将会生成的索引文件并存放在系统之中;searcher是查询器,用来完成对某一词条的搜索并返回结果。
近日,Apache Nutch 1.9 发布,主要改进包括:
改进
[NUTCH-1502] - Test for CrawlDatum state transitions
[NUTCH-1561] - improve usability of parse-metatags and index-metadata
[NUTCH-1676] - Add rudimentary SSL support to protocol-http
[NUTCH-1745] - Upgrade to ElasticSearch 1.1.0
[NUTCH-1747] - Use AtomicInteger as semaphore in Fetcher
[NUTCH-1757] - ParserChecker to take custom metadata as input
[NUTCH-1758] - IndexChecker to send document to IndexWriters
[NUTCH-1772] - Injector does not need merging if no pre-existing crawldb
[NUTCH-1782] - NodeWalker to return current node
[NUTCH-1787] - update and complete API doc overview page
[NUTCH-1794] - IndexingFilterChecker to optionally dumpText
[NUTCH-1799] - ANT Eclipse task discovers all plugin jars automatically
新的特性
[NUTCH-207] - Bandwidth target for fetcher rather than a thread count
[NUTCH-1327] - QueryStringNormalizer
[NUTCH-1590] - [SECURITY] Frame injection vulnerability in published Javadoc