Analyzer分词器¶
简介¶
- 
Analysis 文本分析是把全文本转换一系列单词(term/token)的过程,也叫分词. 
- 
Analysis是通过Analyzer来实现的,可使用elasticsearch内置分分析器,或者按需要定制 
- 
除了在数据写入时转换词条,匹配Query语句也需要相同的分析器对查询语句进行分析 
Analyzer的组成¶
- 分词器是专门处理分词的组件,Analyzer三部分组成
CharacterFilter 针对原始文本处理,例如去除html
Tokenizer 按照规则切分为单词
Token Filter 将切分的单词进行加工,小写,删除,增加同义词
analyzer API¶
- 直接指定annlyzer进行测试
POST /_analyze
{
  "analyze":"standard",
  "text":"Mastering is last"
}
- 指定索引的字段进行测试
#查看所有的分词
POST books/_analyze
{
  "field":"title",
  "text":"Mastering is ElasticSearch"
}
POST _analyze
{
  "tokenizer":"standard", 
  "filter":["lowercase"],
  "text":"Mastering Elasticsearch"
}
常见分词器举例¶
- standard 默认分词器,按照词进行切分,小写处理
POST _analyze
{
  "analyzer": "standard",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
- 
Simple Analyzer – 按照非字母切分(符号被过滤),小写处理,所有的非字母被去除¶
POST _analyze
{
  "analyzer": "simple",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
POST _analyze
{
  "analyzer": "whitespace",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
- 
Keyword Analyzer – 不分词,直接将输入当作输出 
- 
Patter Analyzer – 正则表达式,默认 \W+ (非字符分隔) 
-#Language – 提供了30多种常见语言的分词器
#english
GET _analyze
{
  "analyzer": "english",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
- 中文分词icu需要安装插件anaylze-icu(版本必须跟es一样)
POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "他说的确实在理”"
}
IK