Analyzer分词器¶
简介¶
-
Analysis 文本分析是把全文本转换一系列单词(term/token)的过程,也叫分词.
-
Analysis是通过Analyzer来实现的,可使用elasticsearch内置分分析器,或者按需要定制
-
除了在数据写入时转换词条,匹配Query语句也需要相同的分析器对查询语句进行分析
Analyzer的组成¶
- 分词器是专门处理分词的组件,Analyzer三部分组成
CharacterFilter 针对原始文本处理,例如去除html
Tokenizer 按照规则切分为单词
Token Filter 将切分的单词进行加工,小写,删除,增加同义词
analyzer API¶
- 直接指定annlyzer进行测试
POST /_analyze
{
"analyze":"standard",
"text":"Mastering is last"
}
- 指定索引的字段进行测试
#查看所有的分词
POST books/_analyze
{
"field":"title",
"text":"Mastering is ElasticSearch"
}
POST _analyze
{
"tokenizer":"standard",
"filter":["lowercase"],
"text":"Mastering Elasticsearch"
}
常见分词器举例¶
- standard 默认分词器,按照词进行切分,小写处理
POST _analyze
{
"analyzer": "standard",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
-
Simple Analyzer – 按照非字母切分(符号被过滤),小写处理,所有的非字母被去除¶
POST _analyze
{
"analyzer": "simple",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
POST _analyze
{
"analyzer": "whitespace",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
-
Keyword Analyzer – 不分词,直接将输入当作输出
-
Patter Analyzer – 正则表达式,默认 \W+ (非字符分隔)
-#Language – 提供了30多种常见语言的分词器
#english
GET _analyze
{
"analyzer": "english",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
- 中文分词icu需要安装插件anaylze-icu(版本必须跟es一样)
POST _analyze
{
"analyzer": "icu_analyzer",
"text": "他说的确实在理”"
}
IK