您如何看待模式标记器?我创建了一个正则表达式来将字符串拆分为(?<=(^\\w{4}))|(?<=^\\w{4}(\\w{3}))|(?<=^\\w{4}\\w{3}(\\w{1}))|(?<=^\\w{4}\\w{3}\\w{1}(\\w{2}))
. 之后,我创建了一个这样的分析器:
PUT /myindex
{
"settings": {
"analysis": {
"analyzer": {
"codeanalyzer": {
"type": "pattern",
"pattern":"(?<=(^\\w{4}))|(?<=^\\w{4}(\\w{3}))|(?<=^\\w{4}\\w{3}(\\w{1}))|(?<=^\\w{4}\\w{3}\\w{1}(\\w{2}))"
}
}
}
}
}
POST /myindex/_analyze?analyzer=codeanalyzer&text=ABCD1E2F34
结果是标记化的数据:
{
"tokens": [
{
"token": "abcd",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "1e2",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "f",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 2
},
{
"token": "34",
"start_offset": 8,
"end_offset": 10,
"type": "word",
"position": 3
}
]
}
您还可以查看文档:https ://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html