gpt4 book ai didi

elasticsearch - Elasticsearch 6.8 match_phrase搜索N元语法分词器效果不佳

转载 作者:行者123 更新时间:2023-12-02 23:07:21 25 4
gpt4 key购买 nike

我使用Elasticsearch N-gram tokenizer并使用match_phrase进行模糊匹配
我的索引和测试数据如下:

DELETE /m8
PUT m8
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 3,
"custom_token_chars":"_."
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"table": {
"properties": {
"dataSourceId": {
"type": "long"
},
"dataSourceType": {
"type": "integer"
},
"dbName": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}


PUT /m8/table/1
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rm.rf"
}

PUT /m8/table/2
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rm_rf"
}
PUT /m8/table/3
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rmrf"
}
检查_analyze:
POST m8/_analyze
{
"tokenizer": "my_tokenizer",
"text": "rm.rf"
}
_分析结果:
{
"tokens" : [
{
"token" : "r",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "rm",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "rm.",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "m",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 3
},
{
"token" : "m.",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "m.r",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : ".",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 6
},
{
"token" : ".r",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 7
},
{
"token" : ".rf",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 8
},
{
"token" : "r",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 9
},
{
"token" : "rf",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 10
},
{
"token" : "f",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 11
}
]
}
当我搜索“rm”时,什么都没有找到:
GET /m8/table/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"dbName": "rm"
}
}
]
}
}
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
但是可以找到“.rf”:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.7260926,
"hits" : [
{
"_index" : "m8",
"_type" : "table",
"_id" : "1",
"_score" : 1.7260926,
"_source" : {
"dataSourceId" : 1,
"dataSourceType" : 2,
"dbName" : "rm.rf"
}
}
]
}
}
我的问题:
为什么即使_analyze拆分了这些短语也找不到“rm”?

最佳答案

  • my_analyzer也将在搜索期间使用。
    "mapping":{
    "dbName": {
    "type": "text",
    "analyzer": "my_analyzer"
    "search_analyzer":"my_analyzer" // <==== If you don't provide a search analyzer then what you defined in analyzer will be used during search time as well.
  • Match_phrase查询用于考虑已分析文本的位置来匹配短语。例如,搜索“Kal ho”将匹配分析文本中X位置具有“Kal”和X + 1位置具有“ho”的文档。
  • 当您搜索“rm”(#1)时,将使用my_analyzer分析文本,该文本将其转换为n-gram,并将在该phrase_search的顶部使用。因此,结果是无法预期的。

  • 解决方案:
  • 将标准分析器与简单匹配查询一起使用
    GET /m8/_search
    {
    "query": {
    "bool": {
    "must": [
    {
    "match": {
    "dbName": {
    "query": "rm",
    "analyzer": "standard" // <=========
    }
    }
    }
    ]
    }
    }
    }
    在映射过程中定义并使用匹配查询(而非match_phrase)
    "mapping":{
    "dbName": {
    "type": "text",
    "analyzer": "my_analyzer"
    "search_analyzer":"standard" //<==========

  • 后续问题:为什么要在n-gram标记器中使用 match_phrase 查询?

    关于elasticsearch - Elasticsearch 6.8 match_phrase搜索N元语法分词器效果不佳,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64277914/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com