HanLP中文分词solr插件

xcxc 10年前

HanLP中文分词solr插件

基于HanLP，支持Solr5.x，兼容Lucene5.x。

快速上手

将hanlp-portable.jar和hanlp-solr-plugin.jar共两个jar放入${webapp}/WEB-INF/lib下

修改solr core的配置文件${core}/conf/schema.xml：

<fieldType name="text_cn" class="solr.TextField">  <analyzer type="index" enableIndexMode="true" class="com.hankcs.lucene.HanLPAnalyzer"/>  <analyzer type="query" enableIndexMode="true" class="com.hankcs.lucene.HanLPAnalyzer"/>  </fieldType>

调用方法

在Query改写的时候，可以利用HanLPAnalyzer分词结果中的词性等属性，如

String text = "中华人民共和国很辽阔";  for (int i = 0; i < text.length(); ++i)  {      System.out.print(text.charAt(i) + "" + i + " ");  }  System.out.println();  Analyzer analyzer = new HanLPAnalyzer();  TokenStream tokenStream = analyzer.tokenStream("field", text);  tokenStream.reset();  while (tokenStream.incrementToken())  {      CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class);      // 偏移量      OffsetAttribute offsetAtt = tokenStream.getAttribute(OffsetAttribute.class);      // 距离      PositionIncrementAttribute positionAttr = kenStream.getAttribute(PositionIncrementAttribute.class);      // 词性      TypeAttribute typeAttr = tokenStream.getAttribute(TypeAttribute.class);      System.out.printf("[%d:%d %d] %s/%s\n", offsetAtt.startOffset(), offsetAtt.endOffset(), positionAttr.getPositionIncrement(), attribute, typeAttr.type());  }

在另一些场景，支持以自定义的分词器（比如开启了命名实体识别的分词器、繁体中文分词器、CRF分词器等）构造HanLPTokenizer，比如：

tokenizer = new HanLPTokenizer(HanLP.newSegment()                                      .enableJapaneseNameRecognize(true)                                      .enableIndexMode(true), null, false);  tokenizer.setReader(new StringReader("林志玲亮相网友:确定不是波多野结衣？"));

高级配置

HanLP分词器主要通过class path下的hanlp.properties进行配置，请阅读HanLP自然语言处理包文档以了解更多相关配置，如：

停用词
用户词典
词性标注

项目主页：http://www.open-open.com/lib/view/home/1440338627749

HanLP中文分词solr插件

HanLP中文分词solr插件

快速上手

调用方法

高级配置

相关经验

目录