ElasticSearch 分词篇:中文分词
来自: http://my.oschina.net/secisland/blog/617822?fromerr=qlrJk7Di
在Elasticsearch中,内置了很多分词器(analyzers),但默认的分词器对中文的支持都不是太好。所以需要单独安装插件来支持,比较常用的是中科院 ICTCLAS的smartcn和IKAnanlyzer效果还是不错的,但是目前 IKAnanlyzer 还不支持最新的 Elasticsearch2.2.0版本,但是smartcn中文分词器默认官方支持,它提供了一个中文或混合中文英文文本的分析器。支持最新的 2.2.0版本版本。但是 smartcn 不支持自定义词库,作为测试可先用一下。后面的部分介绍如何支持最新的版本。
smartcn
安装分词: plugin install analysis - smartcn
卸载: plugin remove analysis - smartcn
测试:
请求:POST http://127.0.0.1:9200/_analyze/
{ "analyzer": "smartcn", "text": "联想是全球最大的笔记本厂商" }
返回结果:
{ "tokens": [ { "token": "联想", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "是", "start_offset": 2, "end_offset": 3, "type": "word", "position": 1 }, { "token": "全球", "start_offset": 3, "end_offset": 5, "type": "word", "position": 2 }, { "token": "最", "start_offset": 5, "end_offset": 6, "type": "word", "position": 3 }, { "token": "大", "start_offset": 6, "end_offset": 7, "type": "word", "position": 4 }, { "token": "的", "start_offset": 7, "end_offset": 8, "type": "word", "position": 5 }, { "token": "笔记本", "start_offset": 8, "end_offset": 11, "type": "word", "position": 6 }, { "token": "厂商", "start_offset": 11, "end_offset": 13, "type": "word", "position": 7 } ] }
作为对比,我们看一下标准的分词的结果,在请求中巴smartcn,换成standard
然后看返回结果:
{ "tokens": [ { "token": "联", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "想", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "是", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "全", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "球", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "最", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "大", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "的", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 7 }, { "token": "笔", "start_offset": 8, "end_offset": 9, "type": "<IDEOGRAPHIC>", "position": 8 }, { "token": "记", "start_offset": 9, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 9 }, { "token": "本", "start_offset": 10, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 10 }, { "token": "厂", "start_offset": 11, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 11 }, { "token": "商", "start_offset": 12, "end_offset": 13, "type": "<IDEOGRAPHIC>", "position": 12 } ] }
从中可以看出,基本上不能使用,就是一个汉字变成了一个词了。
本文由赛克 蓝德(secisland)原创,转载请标明作者和出处。
IKAnanlyzer支持2.2.0版本
目前github上最新的版本只支持Elasticsearch2.1.1,路径为https://github.com/medcl/elasticsearch-analysis-ik。但现在最新的Elasticsearch已经到2.2.0了所以要经过处理一下才能支持。
1、下载源码,下载完后解压到任意目录,然后修改elasticsearch-analysis-ik-master目录下的pom.xml文件。找到<elasticsearch.version>行,然后把后面的版本号修改成2.2.0。
2、编译代码mvn package。
3、编译完成后会在target\releases生成elasticsearch-analysis-ik-1.7.0.zip文件。
4、解压文件到Elasticsearch/plugins目录下。
5、修改配置文件增加一行:index.analysis.analyzer.ik.type : "ik"
6、重启 Elasticsearch。
测试:和上面的请求一样,只是把分词替换成ik
返回的结果:
{ "tokens": [ { "token": "联想", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "全球", "start_offset": 3, "end_offset": 5, "type": "CN_WORD", "position": 1 }, { "token": "最大", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 2 }, { "token": "笔记本", "start_offset": 8, "end_offset": 11, "type": "CN_WORD", "position": 3 }, { "token": "笔记", "start_offset": 8, "end_offset": 10, "type": "CN_WORD", "position": 4 }, { "token": "笔", "start_offset": 8, "end_offset": 9, "type": "CN_WORD", "position": 5 }, { "token": "记", "start_offset": 9, "end_offset": 10, "type": "CN_CHAR", "position": 6 }, { "token": "本厂", "start_offset": 10, "end_offset": 12, "type": "CN_WORD", "position": 7 }, { "token": "厂商", "start_offset": 11, "end_offset": 13, "type": "CN_WORD", "position": 8 } ] }
从中可以看出,两个分词器分词的结果还是有区别的。
扩展词库,在config\ik\custom下在mydict.dic中增加需要的词组,然后重启Elasticsearch,需要注意的是文件编码是 UTF-8 无BOM格式编码 。
比如增加了赛克蓝德单词。然后再次查询:
请求:POST http://127.0.0.1:9200/_analyze/
参数:
{ "analyzer": "ik", "text": "赛克蓝德是一家数据安全公司" }
返回结果:
{ "tokens": [ { "token": "赛克蓝德", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "克", "start_offset": 1, "end_offset": 2, "type": "CN_WORD", "position": 1 }, { "token": "蓝", "start_offset": 2, "end_offset": 3, "type": "CN_WORD", "position": 2 }, { "token": "德", "start_offset": 3, "end_offset": 4, "type": "CN_CHAR", "position": 3 }, { "token": "一家", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 4 }, { "token": "一", "start_offset": 5, "end_offset": 6, "type": "TYPE_CNUM", "position": 5 }, { "token": "家", "start_offset": 6, "end_offset": 7, "type": "COUNT", "position": 6 }, { "token": "数据", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 7 }, { "token": "安全", "start_offset": 9, "end_offset": 11, "type": "CN_WORD", "position": 8 }, { "token": "公司", "start_offset": 11, "end_offset": 13, "type": "CN_WORD", "position": 9 } ] }
从上面的结果可以看出已经支持赛克蓝德单词了。
赛克蓝德(secisland)后续会逐步对Elasticsearch的最新版本的各项功能进行分析,近请期待。 也欢迎加入secisland公众号进行关注 。