elasticsearch - 简单查询字符串，带有特殊字符，例如(和=-6ren

elasticsearch - 简单查询字符串，带有特殊字符，例如(和=

转载作者：行者123 更新时间：2023-12-02 23:22:54

25

4

这是我的索引

PUT /my_index
{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "filter": {
                "my_ascii_folding": {
                    "type" : "asciifolding",
                    "preserve_original": "true"
                }
            },
            "analyzer": {
                "include_special_character": {
                    "type":      "custom",
                    "filter": [
                        "lowercase",
                        "my_ascii_folding"
                    ],
                    "tokenizer": "whitespace"
                }
            }
        }
    }
}

这是我的映射:

PUT /my_index/_mapping/formulas
{
   "properties": {
      "content": {
         "type": "text",
         "analyzer": "include_special_character"
      }
   }
}

我的样本数据:

POST /_bulk
{"index":{"_index":"my_index","_type":"formulas"}}
{"content":"formula =IF(SUM(3;4;5))"}
{"index":{"_index":"my_index","_type":"formulas"}}
{"content":"some if words: dif difuse"}

在此查询中，我只想返回带有公式的记录(“formula = IF(SUM(3; 4; 5))”)，但是它同时返回了两者。

GET /my_index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "if(",
        "analyzer": "include_special_character",
        "fields": ["_all"]
    }
  }
}

并且此查询不返回带有公式的记录。

GET /my_index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "=if(",
        "analyzer": "include_special_character",
        "fields": ["_all"]
    }
  }
}

如何修复两个查询以返回期望的结果？

谢谢

最佳答案

首先，我要感谢您为在本地获取要处理的数据集所需的所有请求。使查找问题的答案变得容易得多。

这里发生了一些相当有趣的事情。我想指出的第一件事是，使用_all字段时查询实际上发生了什么，因为有些细微的行为很容易引起混乱。

我将依靠_analyze端点来尝试帮助指出此处发生的情况。

首先，这是查询，用于分析如何根据“内容”字段解释查询:

GET my_index/_analyze
{
  "analyzer": "include_special_character",
  "text": [
    "formula =IF(SUM(3;4;5))"
  ],
  "field": "content"
}

结果:

{
  "tokens": [
    {
      "token": "formula",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "=if(sum(3;4;5))",
      "start_offset": 8,
      "end_offset": 23,
      "type": "word",
      "position": 1
    }
  ]
}

到现在为止还挺好。这可能是您期望看到的。如果您想真正深入了解正在发生的事情，请在分析查询中使用以下内容:

explain: true

现在，如果您从该分析器查询中删除“analyzer”值，则文本输出将保持不变。这是因为我们仅用已设置的分析器来覆盖其选择的分析器。我们回退到要查询的字段及其指定的分析器。

为了证明这一点，我将查询在您提供的索引上没有映射的字段，在一个请求中指定分析器，而在另一个请求中不指定分析器。

在:
GET my_index/_analyze { "analyzer": "include_special_character", "text": [ "formula =IF(SUM(3;4;5))" ], "field": "test" }

出:
{ "tokens": [ { "token": "formula", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 }, { "token": "=if(sum(3;4;5))", "start_offset": 8, "end_offset": 23, "type": "word", "position": 1 } ] }

现在没有指定分析仪。
在:
GET my_index/_analyze { "text": [ "formula =IF(SUM(3;4;5))" ], "field": "test" }

出:
{ "tokens": [ { "token": "formula", "start_offset": 0, "end_offset": 7, "type": "<ALPHANUM>", "position": 0 }, { "token": "if", "start_offset": 9, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 }, { "token": "sum", "start_offset": 12, "end_offset": 15, "type": "<ALPHANUM>", "position": 2 }, { "token": "3;4;5", "start_offset": 16, "end_offset": 21, "type": "<NUM>", "position": 3 } ] }

在第二个示例中，它依靠默认的分析器并以这种方式解释输入，因为没有任何映射的“测试”字段。

现在，您将获得有关“_all”字段的一些信息，以及获得意外结果的原因。根据文档，您应该将"_all" field视为存在的特殊字段，除非明确禁用，否则始终将其视为"text" field。

The _all field is just a text field, and accepts the same parameters that other string fields accept, including analyzer, term_vectors, index_options, and store.

为了完整起见，这是在编制索引时如何分析其他文档的方法。

在:
GET my_index/_analyze { "analyzer": "include_special_character", "text": [ "some if words: dif difuse" ], "field": "content" }

出:
{ "tokens": [ { "token": "some", "start_offset": 0, "end_offset": 4, "type": "word", "position": 0 }, { "token": "if", "start_offset": 5, "end_offset": 7, "type": "word", "position": 1 }, { "token": "words:", "start_offset": 8, "end_offset": 14, "type": "word", "position": 2 }, { "token": "dif", "start_offset": 15, "end_offset": 18, "type": "word", "position": 3 }, { "token": "difuse", "start_offset": 19, "end_offset": 25, "type": "word", "position": 4 } ] }

现在，以分析器为何以某种方式表现现有字段的行为为背景，并在逻辑上将“_all”字段视为已映射为文本的字段。似乎在查询“_all”时，指定的分析器将被忽略，从而无法进行上述工作。希望下面的结果现在不足为奇了。

在:
GET my_index/_analyze { "analyzer": "include_special_character", "text": [ "=if(" ], "field": "_all" }

出:
{ "tokens": [ { "token": "if", "start_offset": 1, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 } ] }

在上面的示例中，无论我指定哪种分析器，由于“_all”字段都被视为映射的文本字段，因此它将使用与之关联的分析器。

现在，当您搜索“_all”字段时，您应该注意到您正在获得匹配，因为索引和搜索分析器都使用“if”一词，这会导致匹配。当您使用_all字段时，索引词和查询词都经过默认的分析器，而不是您指定的分析器，从而使标记“if”出现在文档的“_all”字段和查询文本中。

对我而言，最有趣的部分是“= if(”不返回任何匹配。我通常会假定在这种情况下，它等同于“if”或“if(”，因为除“if”之外的所有内容由于默认的分析器，该部分被扔掉了。在您未如预期那样获得成功的情况下，我相信这与如何使用“=”字符解析查询字符串有关。我试图对这个等号字符的确切功能进行了一些研究，但是除了Lucene语法的一部分之外，我没有看到任何好的文档。我认为了解该等号的含义对您的问题并不重要，但是这绝对是我很好奇的，如果有人能对此有所了解。

当尝试脱离“simple_query_string”来尝试查询时，我确实设法看到两个结果都出现在以下两个查询中……

等于:
GET /my_index/_search { "query": { "match": { "_all": "=if(" } } }

不等于:
GET /my_index/_search { "query": { "match": { "_all": "if(" } } }

因此，现在，在进行了上述所有探索之后，下面是一些有关解决问题的潜在方法的想法。

这是我们要返回匹配的文档的标记...

在:
GET my_index/formulas/AV9GIDTggkgblFY6zpKT/_termvectors?fields=content

出:
{ "_index": "my_index", "_type": "formulas", "_id": "AV9GIDTggkgblFY6zpKT", "_version": 1, "found": true, "took": 0, "term_vectors": { "content": { "field_statistics": { "sum_doc_freq": 7, "doc_count": 2, "sum_ttf": 7 }, "terms": { "=if(sum(3;4;5))": { "term_freq": 1, "tokens": [ { "position": 1, "start_offset": 8, "end_offset": 23 } ] }, "formula": { "term_freq": 1, "tokens": [ { "position": 0, "start_offset": 0, "end_offset": 7 } ] } } } } }

由于上述原因，如果我们将您的查询从“_all”更改为“content”，则您只能使用上面响应中的两个标记之一来获得我们感兴趣的文档的点击率。如果您搜索“= if(sum(3; 4; 5))”或“公式”，将会获得成功。尽管这变得越来越准确，但我认为它无法实现您的目标。

根据需求，我可能考虑过的另一种方法是使用keyword映射。但是，这将比示例更具限制性，因为每个“内容”字段将仅具有一个 token ，即其值的全部。我相信最适合您的问题将要求我们在您的映射中添加n-gram tokenizer。

这是我将用来解决此问题的一系列查询。

索引设置:
PUT /my_index2 { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "analysis": { "filter": { "my_ascii_folding": { "type": "asciifolding", "preserve_original": "true" } }, "analyzer": { "include_special_character_gram": { "type": "custom", "filter": [ "lowercase", "my_ascii_folding" ], "tokenizer": "ngram_tokenizer" } }, "tokenizer": { "ngram_tokenizer": { "type": "ngram", "min_gram": 2, "max_gram": 5, "token_chars": [ "letter", "digit", "punctuation", "symbol" ] } } } } }

map :
PUT /my_index2/_mapping/formulas { "properties": { "content": { "type": "text", "analyzer": "include_special_character_gram" } } }

添加文档:
POST /_bulk {"index":{"_index":"my_index2","_type":"formulas"}} {"content":"formula =IF(SUM(3;4;5))"} {"index":{"_index":"my_index2","_type":"formulas"}} {"content":"some if words: dif difuse"}

第一个文档的术语 vector :
GET my_index2/formulas/AV9GZ3sSgkgblFY6zpK2/_termvectors?fields=content

出:
{ "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK2", "_version": 1, "found": true, "took": 0, "term_vectors": { "content": { "field_statistics": { "sum_doc_freq": 102, "doc_count": 2, "sum_ttf": 106 }, "terms": { "(3": { "term_freq": 1, "tokens": [ { "position": 46, "start_offset": 15, "end_offset": 17 } ] }, "(3;": { "term_freq": 1, "tokens": [ { "position": 47, "start_offset": 15, "end_offset": 18 } ] }, ... Omitting the rest because of max response lengths. } } }

现在让我们总结一下这个示例...这是我之前使用的查询，该查询返回了您的两个条目，并在此处继续执行相同的操作。

在:
GET /my_index2/_search { "query": { "match": { "content": { "analyzer": "keyword", "query": "=if(" } } } }

出:
{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 2.9511943, "hits": [ { "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK2", "_score": 2.9511943, "_source": { "content": "formula =IF(SUM(3;4;5))" } }, { "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK3", "_score": 0.30116585, "_source": { "content": "some if words: dif difuse" } } ] } }

所以我们看到了相同的结果，但是为什么会这样呢？在上面的查询中，我们现在将相同的n-gram分析器应用于输入文本，这意味着两个文档仍将具有匹配的标记!

在:
GET my_index2/_analyze { "analyzer": "include_special_character_gram", "text": [ "=if(" ], "field": "t" }

出:
{ "tokens": [ { "token": "=i", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "=if", "start_offset": 0, "end_offset": 3, "type": "word", "position": 1 }, { "token": "=if(", "start_offset": 0, "end_offset": 4, "type": "word", "position": 2 }, { "token": "if", "start_offset": 1, "end_offset": 3, "type": "word", "position": 3 }, { "token": "if(", "start_offset": 1, "end_offset": 4, "type": "word", "position": 4 }, { "token": "f(", "start_offset": 2, "end_offset": 4, "type": "word", "position": 5 } ] }

如果运行上面的查询，您将看到查询生成的 token 。此处的关键要素是现在将查询分析器指定为“关键字”，这样，您将使用与针对该字段的分析器查询不同的索引项 vector 之一来匹配整个查询值。

在:
GET my_index2/_analyze { "analyzer": "keyword", "text": [ "=if(" ] }

出:
{ "tokens": [ { "token": "=if(", "start_offset": 0, "end_offset": 4, "type": "word", "position": 0 } ] }

让我们看看它是否有效...

在:
GET /my_index2/_search { "query": { "match": { "content": { "query": "=if(", "analyzer": "keyword" } } } }

出:
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 0.56074005, "hits": [ { "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK2", "_score": 0.56074005, "_source": { "content": "formula =IF(SUM(3;4;5))" } } ] } }

因此，基于以上内容，您可以看到在我们针对已存储的n-gram分析字段明确为搜索分析器指定关键字分析器时，它如何工作。这是我们可以应用于映射的更新，它将简化我们的请求...(请注意，您将要破坏现有索引或
PUT /my_index2/_mapping/formulas { "properties": { "content": { "type": "text", "analyzer": "include_special_character_gram", "search_analyzer": "keyword" } } }

现在，让我们回到我最初用来显示两个文档都返回的匹配查询。

在:
GET /my_index2/_search { "query": { "match": { "content": "=if(" } } }

出:
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 0.56074005, "hits": [ { "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK2", "_score": 0.56074005, "_source": { "content": "formula =IF(SUM(3;4;5))" } } ] } }

编辑-以simple_query_string查询

在:
GET /my_index2/_search { "query": { "simple_query_string": { "query": "=if\\(", "fields": ["content"] } } }

出:
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 0.56074005, "hits": [ { "_index": "my_index2", "_type": "formulas", "_id": "AV9GZ3sSgkgblFY6zpK2", "_score": 0.56074005, "_source": { "content": "formula =IF(SUM(3;4;5))" } } ] } }

那里有。如果您选择这条路线，显然可以摆弄n克大小的东西。这个答案已经足够冗长了，因此我不会尝试提供其他可以解决此问题的方法，但是我认为有一个解决方案会有所帮助。我认为这里重要的是使用_all字段和查询字符串的解释来了解幕后情况。

希望这会有所帮助，并感谢您提出了有趣的问题。

关于elasticsearch - 简单查询字符串，带有特殊字符，例如(和=，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46877483/

25

4

0

文章推荐： powershell - 使用 powershell 按 TAG 值列出的 AWS 实例列表

java - 查看端口问题中的元素[特殊]
我以一种特殊的方式收到以下错误。 The point at which the driver is attempting to click on the element was not scrolle
java - “特殊” APP用例
我有一些包含如下方法的编译库： public boolean foo(String userID) { Class ntSystemClass = Thread.currentThread()
MySQL 特殊 ORDER BY
假设我有下表 name | genre --------------------- book 1 | scifi book 2 | horror book 3
ios - 特殊 - 字符串中的汉字
我正在用代码进行语言翻译。 self.title.text = [NSString stringWithFormat:NSLocalizedString(@"Q%ld", nil), (long)qu
r - 询问〜特殊〜并返回答案的函数
我想这样做，但到目前为止，我所拥有的只是: print("Will you go out with me?") 我希望代码能够正常工作，以便人们可以回答“是/否”，如果回答是"is"，则将返回一条消息
c# - 特殊 HTML 字符
这个问题在这里已经有了答案: 关闭 11 年前。 Possible Duplicate: How can I decode html characters in c#? 我有来自 HTML 的字符，
javascript - 特殊 ucwords 的正则表达式
我想在 JavaScript 中对以下形式的字符串执行 ucwords()，它应该返回 Test1_Test2_Test3。我已经在 SO 上找到了一个 ucwords 函数，但它只需要空格作为新词
javascript - 两个数组的求和\相加(特殊)
“任何长度的正数表示为数字字符数组，因此介于‘0’和‘9’之间。我们知道最重要的密码位于数组索引 0 的位置。例子: - 号码是 10282 - 数组将是数字 = [1,0,2,8,2] 考虑到这一
Android 特殊 Unicode 字符
我目前正在开发一个显示特殊 unicode 字符(例如 ꁴ)的应用现在我遇到了在旧设备上无法显示这些符号的问题。我如何知道它是否适用于当前设备？我是否必须为每个 SDK 版本创建一个虚拟 Andr
html - 特殊 HTML 构造标签的名称
在 HTML、XML 和部分 DTD 中，有两种特殊的标记结构: 以感叹号开头的标签结束，例如和以问号开头的标签，例如和我的问题是，这些构造类型中的每一种是否都有不同的名称，或者我是否必
Python stdout 重定向(特殊)
我目前正在用 python 构建一个 shell。shell 可以执行 python 文件，但我还需要添加使用 PIPE 的选项(例如“|”表示第一个命令的输出将是第二个命令的输入)。为了做到这一点
c# - 特殊 MVC 路由不起作用
我的 MVC 项目中的路由无法正常工作... 我希望我所有的 View 都在 Views > Shared 文件夹中，如下所示: Error.cshtml (default) Index.cshtml
Java - 特殊 URL 字符
我有一个函数: public static ImageIcon GetIconImageFromResource(String path){ URL url = ARMMain.class.g
html - 特殊 HTML 字符
好的，所以我想在我的 html 页面中包含下面的字符。看起来很简单，只是我找不到它们的 HTML 编码。注意:我想在没有大小元素的情况下执行此操作，纯文本就可以了 ^_^。干杯。最佳答案你可以
java - 特殊 Java 注释标签的完整列表
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。关闭 3 年前。
c# - 特殊(或外国)字符
我是 C# 的新手，正在尝试使用 ASP.Net GridView(框架 3.5)，当 gridView 文本包含以下内容时，我发现了一个大问题: ñ/Ñ/á/Á/é/É/í/Í/ó/Ó/ú/Ú or
特殊 URL 的 Java 正则表达式
在 Java 中，我尝试编写一个正则表达式来匹配特殊类型的 HTTP URL: http:///# 所以字符串有 4 段: 字符串文字:“http://”；那么任意 1 个以上字符的字符串；那么字
mysql 有内部(特殊)字 "to"吗？
当我写查询时，我在表中有“to”列 SELECT to FROM mytable mysql_error 返回错误，如果将单词to插入``引号，即 SELECT `to` FROM mytable 查
python - 匹配大写/特殊/unicode/越南字符的正则表达式
我遇到了一个问题。事实上，我使用越南语文本，我想找到每个包含大写字母(大写字母)的单词。当我使用“re”模块时，我的函数 (temp) 没有捕捉到像“Đà”这样的词。另一种方法 (temp2) 是一次
python - 替换多个(特殊)字符 - 最有效的方法？
在我的文本中，我想用一个空格替换以下特殊字符: symbols = ["`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_",

首页

博学

6Ren·AI

商城

elasticsearch - 简单查询字符串，带有特殊字符，例如(和=