gpt4 book ai didi

Elasticsearch - 分析器创建正确的标记但查询不匹配

转载 作者:行者123 更新时间:2023-11-29 02:48:30 25 4
gpt4 key购买 nike

我试图让 Elasticsearch 忽略连字符。我不希望它把连字符的两边分成单独的词。这看起来很简单,但我正在用头撞墙。

我希望字符串“Roland JD-Xi”产生以下术语:[ roland jd-xi, roland, jd-xi, jdxi, roland jdxi ]

我没能轻易做到这一点。大多数人只会输入“jdxi”,所以我最初的想法是只删除连字符。所以我使用以下定义

  name: {
"type": "string",
"analyzer": "language",
"include_in_all": true,
"boost": 5,
"fields": {
"my_standard": {
"type": "string",
"analyzer": "my_standard"
},
"my_prefix": {
"type": "string",
"analyzer": "my_text_prefix",
"search_analyzer": "my_standard"
},
"my_suffix": {
"type": "string",
"analyzer": "my_text_suffix",
"search_analyzer": "my_standard"
}
}

相关的分析器和过滤器定义为

{
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
"analyzer": {
"std": {
"tokenizer": "standard",
"char_filter": "html_strip",
"filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "length", "strip_hyphens"]
...
"my_text_prefix": {
"tokenizer": "whitespace",
"char_filter": "my_filter",
"filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "edge_ngram_front"]
},
"my_text_suffix": {
"tokenizer": "whitespace",
"char_filter": "my_filter",
"filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "edge_ngram_back"]
},
"my_standard": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": "my_filter",
"filter": ["standard", "elision", "asciifolding", "lowercase"]
}
},
"char_filter": {
"my_filter": {
"type": "mapping",
"mappings": ["- => ", ". => "]
}
},
"filter": {
"edge_ngram_front": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 20,
"side": "front"
},
"edge_ngram_back": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 20,
"side": "back"
},
"strip_spaces": {
"type": "pattern_replace",
"pattern": "\\s",
"replacement": ""
},
"strip_dots": {
"type": "pattern_replace",
"pattern": "\\.",
"replacement": ""
},
"strip_hyphens": {
"type": "pattern_replace",
"pattern": "-",
"replacement": ""
},
"stop": {
"type": "stop",
"stopwords": "_none_"
},
"length": {
"type": "length",
"min": 1
}
}
}

我已经能够测试(即_analyze)这个字符串“Roland JD-Xi”被标记为[ 罗兰, jdxi ]

它不完全是我想要的,但足够接近,因为它应该匹配“jdxi”。

但这就是我的问题。如果我执行简单的“index/_search?q=jdxi”,它不会带回文档。但是,如果我执行“index/_search?q=roland+jdxi”,它确实会带回文档。

所以至少我知道连字符已被删除,但如果正在创建标记“roland”和“jdxi”,为什么“index/_search?q=jdxi”与文档不匹配?

  1. 我的问题是索引过程还是查询过程?
  2. 我该如何解决?
  3. 谁能解释一下如何获得所需的代币[ roland jd-xi, roland, jd-xi, jdxi, roland jdxi ]

最佳答案

我已经在 ES 6 上重现了您的案例并搜索 index/_search?q=jdxi 返回了文档。

问题可能是当搜索 index/_search?q=jdxi 而不指定字段时,它基本上会在 _all 中搜索,其中包含 name字段(与index/_search?q=name:jdxi基本相同)。由于该字段未使用您的 my_standard 分析器进行分析,因此您不会获得任何结果。

您应该做的是使用 my_standard 子字段进行搜索,即 index/_search?q=name.my_standard:jdxi 并且很确定您会得到文档。

关于Elasticsearch - 分析器创建正确的标记但查询不匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49405148/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com