gpt4 book ai didi

elasticsearch - 字段中的点不用于分解分析器的单词

转载 作者:行者123 更新时间:2023-12-03 00:41:04 24 4
gpt4 key购买 nike

我有以下索引文档映射(简化)

{
"documents": {
"mappings": {
"document": {
"properties": {
"filename": {
"type": "string",
"fields": {
"lower_case_sort": {
"type": "string",
"analyzer": "case_insensitive_sort"
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}

我将两个文档放入此索引

{
"_index": "documents",
"_type": "document",
"_id": "777",
"_source": {
"filename": "text.txt",
}
}

...

{
"_index": "documents",
"_type": "document",
"_id": "888",
"_source": {
"filename": "text 123.txt",
}
}

对“文本”进行 query_string 或 simple_query_string 查询,我希望能返回两个文档。它们应该匹配,因为文件名是“text.txt”和“text 123.txt”。

http://localhost:9200/defiant/_search?q=text

但是,我只找到名称为“test 123.txt”的文档 - 仅当我搜索“test.*”或“test.txt”或“test.???”时才能找到“test.txt”。 ” - 我必须在文件名中添加点。

这是我针对文档 id 777 (text.txt) 的解释结果

curl -XGET 'http://localhost:9200/documents/document/777/_explain' -d '{"query": {"query_string" : {"query" : "text"}}}'

-->

{
"_index": "documents",
"_type": "document",
"_id": "777",
"matched": false,
"explanation": {
"value": 0.0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [{
"value": 0.0,
"description": "no match on required clause (_all:text)",
"details": [{
"value": 0.0,
"description": "no matching term",
"details": []
}]
}, {
"value": 0.0,
"description": "match on required clause, product of:",
"details": [{
"value": 0.0,
"description": "# clause",
"details": []
}, {
"value": 0.47650534,
"description": "_type:document, product of:",
"details": [{
"value": 1.0,
"description": "boost",
"details": []
}, {
"value": 0.47650534,
"description": "queryNorm",
"details": []
}]
}]
}]
}
}

我是否搞砸了映射?我本以为“.”当文档被索引时被分析为术语分隔符...

编辑:case_insensitive_sort的设置

{
"documents": {
"settings": {
"index": {
"creation_date": "1473169458336",
"analysis": {
"analyzer": {
"case_insensitive_sort": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
}
}
}

最佳答案

这将是标准分析器(默认分析器)的预期行为,因为它使用 standard tokenizer并根据 algorithm它使用时,不被视为分隔符。

您可以在 analyze api 的帮助下验证这一点

curl -XGET 'localhost:9200/_analyze' -d '
{
"analyzer" : "standard",
"text" : "test.txt"
}'

仅生成单个 token

{
"tokens": [
{
"token": "test.txt",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
}
]
}

您可以使用pattern replace char filter将点替换为空白。

{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"replace_dot"
]
}
},
"char_filter": {
"replace_dot": {
"type": "pattern_replace",
"pattern": "\\.",
"replacement": " "
}
}
}
}
}

您必须重新索引您的文档,然后您将获得所需的结果。 Analyze api 非常方便地检查文档如何存储在倒排索引中。

更新

您必须指定要搜索的字段的名称。以下请求在 _all field 中查找文本默认情况下使用标准分析器。

http://localhost:9200/defiant/_search?q=text

我认为下面的查询应该会给你想要的结果。

curl -XGET 'http://localhost:9200/twitter/_search?q=filename:text'

关于elasticsearch - 字段中的点不用于分解分析器的单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40135688/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com