文档规范化(normalization)
文档规范化,提高召回率
示例代码
#normalization
GET _analyze
{
"text": "Mr. Ma is an excellent teacher",
"analyzer": "english"
}
字符过滤器(character filter)
分词之前的预处理,过滤无用字符
html标签过滤器
官方参考地址
HTML strip character filter | Elasticsearch Guide [8.11] | Elastic
示例代码
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"text": "I'm so happy!
"
}
字符映射过滤器(MappingCharFilter)
官方参考地址
Mapping character filter | Elasticsearch Guide [8.11] | Elastic
示例代码
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter":{
"type":"mapping",
"mappings":[
"滚 => *",
"垃 => *",
"圾 => *"
]
}
},
"analyzer": {
"my_analyzer":{
"tokenizer":"keyword",
"char_filter":["my_char_filter"]
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "你就是个垃圾!滚"
}
正则过滤器
官方参考地址
Pattern replace character filter | Elasticsearch Guide [8.11] | Elastic
示例代码
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter":{
"type":"pattern_replace",
"pattern":"(\d{3})\d{4}(\d{4})",
"replacement":"****"
}
},
"analyzer": {
"my_analyzer":{
"tokenizer":"keyword",
"char_filter":["my_char_filter"]
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "您的手机号是17611001200"
}
令牌过滤器(token filter)
停用词、时态转换、大小写转换、同义词转换、语气词处理等。比如:has=>have him=>he apples=>apple
示例代码
#停用词
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"stopwords":["me","you"]
}
}
}
}
}
GET test_index/_analyze
{
"analyzer": "my_analyzer",
"text": ["Teacher me and you in the china"]
}
分词器(tokenizer)
切词
官方参考地址
Tokenizer reference | Elasticsearch Guide [8.11] | Elastic
常见分词器
-
standard analyzer:默认分词器,中文支持的不理想,会逐字拆分。
-
pattern tokenizer:以正则匹配分隔符,把文本拆分成若干词项。
-
simple pattern tokenizer:以正则匹配词项,速度比pattern tokenizer快。
-
whitespace analyzer:以空白符分隔
-
ik分词器:中文分词器(git地址:GitHub - medcl/elasticsearch-analysis-ik: The IK Analysis plugin integrates Lucene IK analyzer into elasticsearch, support customized dictionary.)
示例代码
#分词器 tokenizer
POST _analyze
{
"analyzer": "ik_max_word",
"text": "小孩儿不能吃糖"
}
自定义分词器
-
char_filter:内置或自定义字符过滤器 。
-
token filter:内置或自定义token filter 。
-
tokenizer:内置或自定义分词器。
示例代码
PUT custom_analysis { "settings": { "analysis": { "char_filter": { "my_char_filter": { "type": "mapping", "mappings": [ "& => and", "| => or" ] }, "html_strip_char_filter":{ "type":"html_strip", "escaped_tags":["a"] } }, "filter": { "my_stopword": { "type": "stop", "stopwords": [ "is", "in", "the", "a", "at", "for" ] } }, "tokenizer": { "my_tokenizer": { "type": "pattern", "pattern": "[ ,.!?]" } }, "analyzer": { "my_analyzer":{ "type":"custom", "char_filter":["my_char_filter","html_strip_char_filter"], "filter":["my_stopword","lowercase"], "tokenizer":"my_tokenizer" } } } } } GET custom_analysis/_analyze { "analyzer": "my_analyzer", "text": ["What is ,as.df ssin ? &
| is ! in the a at for "] }
猜你喜欢
网友评论
- 搜索
- 最新文章
- 热门文章
