gpt4 book ai didi

elasticsearch - 简单查询字符串,带有特殊字符,例如(和=

转载 作者:行者123 更新时间:2023-12-02 23:22:54 25 4
gpt4 key购买 nike

这是我的索引

PUT /my_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"my_ascii_folding": {
"type" : "asciifolding",
"preserve_original": "true"
}
},
"analyzer": {
"include_special_character": {
"type": "custom",
"filter": [
"lowercase",
"my_ascii_folding"
],
"tokenizer": "whitespace"
}
}
}
}
}

这是我的映射:
PUT /my_index/_mapping/formulas
{
"properties": {
"content": {
"type": "text",
"analyzer": "include_special_character"
}
}
}

我的样本数据:
POST /_bulk
{"index":{"_index":"my_index","_type":"formulas"}}
{"content":"formula =IF(SUM(3;4;5))"}
{"index":{"_index":"my_index","_type":"formulas"}}
{"content":"some if words: dif difuse"}

在此查询中,我只想返回带有公式的记录(“formula = IF(SUM(3; 4; 5))”),但是它同时返回了两者。
GET /my_index/_search
{
"query": {
"simple_query_string" : {
"query": "if(",
"analyzer": "include_special_character",
"fields": ["_all"]
}
}
}

并且此查询不返回带有公式的记录。
GET /my_index/_search
{
"query": {
"simple_query_string" : {
"query": "=if(",
"analyzer": "include_special_character",
"fields": ["_all"]
}
}
}

如何修复两个查询以返回期望的结果?

谢谢

最佳答案

首先,我要感谢您为在本地获取要处理的数据集所需的所有请求。使查找问题的答案变得容易得多。

这里发生了一些相当有趣的事情。我想指出的第一件事是,使用_all字段时查询实际上发生了什么,因为有些细微的行为很容易引起混乱。

我将依靠_analyze端点来尝试帮助指出此处发生的情况。

首先,这是查询,用于分析如何根据“内容”字段解释查询:

GET my_index/_analyze
{
"analyzer": "include_special_character",
"text": [
"formula =IF(SUM(3;4;5))"
],
"field": "content"
}

结果:
{
"tokens": [
{
"token": "formula",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "=if(sum(3;4;5))",
"start_offset": 8,
"end_offset": 23,
"type": "word",
"position": 1
}
]
}

到现在为止还挺好。这可能是您期望看到的。如果您想真正深入了解正在发生的事情,请在分析查询中使用以下内容:
explain: true

现在,如果您从该分析器查询中删除“analyzer”值,则文本输出将保持不变。这是因为我们仅用已设置的分析器来覆盖其选择的分析器。我们回退到要查询的字段及其指定的分析器。

为了证明这一点,我将查询在您提供的索引上没有映射的 字段,在一个请求中指定分析器,而在另一个请求中不指定分析器。

在:
GET my_index/_analyze
{
"analyzer": "include_special_character",
"text": [
"formula =IF(SUM(3;4;5))"
],
"field": "test"
}

出:
{
"tokens": [
{
"token": "formula",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "=if(sum(3;4;5))",
"start_offset": 8,
"end_offset": 23,
"type": "word",
"position": 1
}
]
}

现在没有指定分析仪。
在:
GET my_index/_analyze
{
"text": [
"formula =IF(SUM(3;4;5))"
],
"field": "test"
}

出:
{
"tokens": [
{
"token": "formula",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "if",
"start_offset": 9,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "sum",
"start_offset": 12,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "3;4;5",
"start_offset": 16,
"end_offset": 21,
"type": "<NUM>",
"position": 3
}
]
}

在第二个示例中,它依靠默认的分析器并以这种方式解释输入,因为没有任何映射的“测试”字段。

现在,您将获得有关“_all”字段的一些信息,以及获得意外结果的原因。根据文档,您应该将"_all" field视为存在的特殊字段,除非明确禁用,否则始终将其视为"text" field

The _all field is just a text field, and accepts the same parameters that other string fields accept, including analyzer, term_vectors, index_options, and store.



为了完整起见,这是在编制索引时如何分析其他文档的方法。

在:
GET my_index/_analyze
{
"analyzer": "include_special_character",
"text": [
"some if words: dif difuse"
],
"field": "content"
}

出:
{
"tokens": [
{
"token": "some",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "if",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "words:",
"start_offset": 8,
"end_offset": 14,
"type": "word",
"position": 2
},
{
"token": "dif",
"start_offset": 15,
"end_offset": 18,
"type": "word",
"position": 3
},
{
"token": "difuse",
"start_offset": 19,
"end_offset": 25,
"type": "word",
"position": 4
}
]
}

现在,以分析器为何以某种方式表现现有字段的行为为背景,并在逻辑上将“_all”字段视为已映射为文本的字段。似乎在查询“_all”时,指定的分析器将被忽略,从而无法进行上述工作。希望下面的结果现在不足为奇了。

在:
GET my_index/_analyze
{
"analyzer": "include_special_character",
"text": [
"=if("
],
"field": "_all"
}

出:
{
"tokens": [
{
"token": "if",
"start_offset": 1,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
}
]
}

在上面的示例中,无论我指定哪种分析器,由于“_all”字段都被视为映射的文本字段,因此它将使用与之关联的分析器。

现在,当您搜索“_all”字段时,您应该注意到您正在获得匹配,因为索引和搜索分析器都使用“if”一词,这会导致匹配。当您使用_all字段时,索引词和查询词都经过默认的分析器,而不是您指定的分析器,从而使标记“if”出现在文档的“_all”字段和查询文本中。

对我而言,最有趣的部分是“= if(”不返回任何匹配。我通常会假定在这种情况下,它等同于“if”或“if(”,因为除“if”之外的所有内容由于默认的分析器,该部分被扔掉了。在您未如预期那样获得成功的情况下,我相信这与如何使用“=”字符解析查询字符串有关。我试图对这个等号字符的确切功能进行了一些研究,但是除了Lucene语法的一部分之外,我没有看到任何好的文档。我认为了解该等号的含义对您的问题并不重要,但是这绝对是我很好奇的,如果有人能对此有所了解。

当尝试脱离“simple_query_string”来尝试查询时,我确实设法看到两个结果都出现在以下两个查询中……

等于:
GET /my_index/_search
{
"query": {
"match": {
"_all": "=if("
}
}
}

不等于:
GET /my_index/_search
{
"query": {
"match": {
"_all": "if("
}
}
}

因此,现在,在进行了上述所有探索之后,下面是一些有关解决问题的潜在方法的想法。

这是我们要返回匹配的文档的标记...

在:
GET my_index/formulas/AV9GIDTggkgblFY6zpKT/_termvectors?fields=content

出:
{
"_index": "my_index",
"_type": "formulas",
"_id": "AV9GIDTggkgblFY6zpKT",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"content": {
"field_statistics": {
"sum_doc_freq": 7,
"doc_count": 2,
"sum_ttf": 7
},
"terms": {
"=if(sum(3;4;5))": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 8,
"end_offset": 23
}
]
},
"formula": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 7
}
]
}
}
}
}
}

由于上述原因,如果我们将您的查询从“_all”更改为“content”,则您只能使用上面响应中的两个标记之一来获得我们感兴趣的文档的点击率。如果您搜索“= if(sum(3; 4; 5))”或“公式”,将会获得成功。尽管这变得越来越准确,但我认为它无法实现您的目标。

根据需求,我可能考虑过的另一种方法是使用keyword映射。但是,这将比示例更具限制性,因为每个“内容”字段将仅具有一个 token ,即其值的全部。我相信最适合您的问题将要求我们在您的映射中添加n-gram tokenizer

这是我将用来解决此问题的一系列查询。

索引设置:
PUT /my_index2
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"my_ascii_folding": {
"type": "asciifolding",
"preserve_original": "true"
}
},
"analyzer": {
"include_special_character_gram": {
"type": "custom",
"filter": [
"lowercase",
"my_ascii_folding"
],
"tokenizer": "ngram_tokenizer"
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
}
}
}
}

map :
PUT /my_index2/_mapping/formulas
{
"properties": {
"content": {
"type": "text",
"analyzer": "include_special_character_gram"
}
}
}

添加文档:
POST /_bulk
{"index":{"_index":"my_index2","_type":"formulas"}}
{"content":"formula =IF(SUM(3;4;5))"}
{"index":{"_index":"my_index2","_type":"formulas"}}
{"content":"some if words: dif difuse"}

第一个文档的术语 vector :
GET my_index2/formulas/AV9GZ3sSgkgblFY6zpK2/_termvectors?fields=content

出:
{
"_index": "my_index2",
"_type": "formulas",
"_id": "AV9GZ3sSgkgblFY6zpK2",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"content": {
"field_statistics": {
"sum_doc_freq": 102,
"doc_count": 2,
"sum_ttf": 106
},
"terms": {
"(3": {
"term_freq": 1,
"tokens": [
{
"position": 46,
"start_offset": 15,
"end_offset": 17
}
]
},
"(3;": {
"term_freq": 1,
"tokens": [
{
"position": 47,
"start_offset": 15,
"end_offset": 18
}
]
},
... Omitting the rest because of max response lengths.
}
}
}

现在让我们总结一下这个示例...这是我之前使用的查询,该查询返回了您的两个条目,并在此处继续执行相同的操作。

在:
GET /my_index2/_search
{
"query": {
"match": {
"content": {
"analyzer": "keyword",
"query": "=if("
}
}
}
}

出:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 2.9511943,
"hits": [
{
"_index": "my_index2",
"_type": "formulas",
"_id": "AV9GZ3sSgkgblFY6zpK2",
"_score": 2.9511943,
"_source": {
"content": "formula =IF(SUM(3;4;5))"
}
},
{
"_index": "my_index2",
"_type": "formulas",
"_id": "AV9GZ3sSgkgblFY6zpK3",
"_score": 0.30116585,
"_source": {
"content": "some if words: dif difuse"
}
}
]
}
}

所以我们看到了相同的结果,但是为什么会这样呢?在上面的查询中,我们现在将相同的n-gram分析器应用于输入文本,这意味着两个文档仍将具有匹配的标记!

在:
GET my_index2/_analyze
{
"analyzer": "include_special_character_gram",
"text": [
"=if("
],
"field": "t"
}

出:
{
"tokens": [
{
"token": "=i",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "=if",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "=if(",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "if",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 3
},
{
"token": "if(",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 4
},
{
"token": "f(",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 5
}
]
}

如果运行上面的查询,您将看到查询生成的 token 。此处的关键要素是现在将查询分析器指定为“关键字”,这样,您将使用与针对该字段的分析器查询不同的索引项 vector 之一来匹配整个查询值。

在:
GET my_index2/_analyze
{
"analyzer": "keyword",
"text": [
"=if("
]
}

出:
{
"tokens": [
{
"token": "=if(",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}

让我们看看它是否有效...

在:
GET /my_index2/_search
{
"query": {
"match": {
"content": {
"query": "=if(",
"analyzer": "keyword"
}
}
}
}

出:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.56074005,
"hits": [
{
"_index": "my_index2",
"_type": "formulas",
"_id": "AV9GZ3sSgkgblFY6zpK2",
"_score": 0.56074005,
"_source": {
"content": "formula =IF(SUM(3;4;5))"
}
}
]
}
}

因此,基于以上内容,您可以看到在我们针对已存储的n-gram分析字段明确为搜索分析器指定关键字分析器时,它如何工作。这是我们可以应用于映射的更新,它将简化我们的请求...(请注意,您将要破坏现有索引或
PUT /my_index2/_mapping/formulas
{
"properties": {
"content": {
"type": "text",
"analyzer": "include_special_character_gram",
"search_analyzer": "keyword"

}
}
}

现在,让我们回到我最初用来显示两个文档都返回的匹配查询。

在:
GET /my_index2/_search
{
"query": {
"match": {
"content": "=if("
}
}
}

出:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.56074005,
"hits": [
{
"_index": "my_index2",
"_type": "formulas",
"_id": "AV9GZ3sSgkgblFY6zpK2",
"_score": 0.56074005,
"_source": {
"content": "formula =IF(SUM(3;4;5))"
}
}
]
}
}

编辑-以simple_query_string查询

在:
GET /my_index2/_search
{
"query": {
"simple_query_string": {
"query": "=if\\(",
"fields": ["content"]
}
}
}

出:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.56074005,
"hits": [
{
"_index": "my_index2",
"_type": "formulas",
"_id": "AV9GZ3sSgkgblFY6zpK2",
"_score": 0.56074005,
"_source": {
"content": "formula =IF(SUM(3;4;5))"
}
}
]
}
}

那里有。如果您选择这条路线,显然可以摆弄n克大小的东西。这个答案已经足够冗长了,因此我不会尝试提供其他可以解决此问题的方法,但是我认为有一个解决方案会有所帮助。我认为这里重要的是使用_all字段和查询字符串的解释来了解幕后情况。

希望这会有所帮助,并感谢您提出了有趣的问题。

关于elasticsearch - 简单查询字符串,带有特殊字符,例如(和=,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46877483/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com