Analyzer分词器¶
简介¶
-
Analysis 文本分析是把全文本转换一系列单词(term/token)的过程,也叫分词.
-
Analysis是通过Analyzer来实现的,可使用elasticsearch内置分分析器,或者按需要定制
-
除了在数据写入时转换词条,匹配Query语句也需要相同的分析器对查询语句进行分析
Analyzer的组成¶
- 分词器是专门处理分词的组件,Analyzer三部分组成
CharacterFilter 针对原始文本处理,例如去除html
Tokenizer 按照规则切分为单词
Token Filter 将切分的单词进行加工,小写,删除,增加同义词
analyzer API¶
- 直接指定annlyzer进行测试
POST /_analyze { "analyze":"standard", "text":"Mastering is last" }
- 指定索引的字段进行测试
#查看所有的分词 POST books/_analyze { "field":"title", "text":"Mastering is ElasticSearch" }
POST _analyze { "tokenizer":"standard", "filter":["lowercase"], "text":"Mastering Elasticsearch" }
常见分词器举例¶
- standard 默认分词器,按照词进行切分,小写处理
POST _analyze { "analyzer": "standard", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." }
-
Simple Analyzer – 按照非字母切分(符号被过滤),小写处理,所有的非字母被去除¶
POST _analyze { "analyzer": "simple", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." }
POST _analyze { "analyzer": "whitespace", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." }
-
Keyword Analyzer – 不分词,直接将输入当作输出
-
Patter Analyzer – 正则表达式,默认 \W+ (非字符分隔)
-#Language – 提供了30多种常见语言的分词器
#english GET _analyze { "analyzer": "english", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." }
- 中文分词icu需要安装插件anaylze-icu(版本必须跟es一样)
POST _analyze { "analyzer": "icu_analyzer", "text": "他说的确实在理”" }
IK