elasticsearch - 如何将标准 token 生成器与preserve

elasticsearch - 如何将标准 token 生成器与preserve_original一起使用？

转载作者：行者123 更新时间：2023-12-03 01:35:37

我创建了2个自定义分析器，如下所示，但两者均无法按我的要求工作。
这是我想要的倒排索引
例如;对于单词reb-tn2000xxxl我需要
我的反向索引中的reb，tn2000xxl和reb-tn2000xxxl。

{  
   "analysis":{  
      "filter":{  
         "my_word_delimiter":{  
            "split_on_numerics":"true",
            "generate_word_parts":"true",
            "preserve_original":"true",
            "generate_number_parts":"true",
            "catenate_all":"true",
            "split_on_case_change":"true",
            "type":"word_delimiter"
         }
      },
      "analyzer":{  
         "my_analyzer":{  
            "filter":[  
               "standard",
               "lowercase",
               "my_word_delimiter"
            ],
            "type":"custom",
            "tokenizer":"whitespace"
         },
         "standard_caseinsensitive":{  
            "filter":[  
               "standard",
               "lowercase"
            ],
            "type":"custom",
            "tokenizer":"keyword"
         },
         "my_delimiter":{  
            "filter":[  
               "lowercase",
               "my_word_delimiter"
            ],
            "type":"custom",
            "tokenizer":"standard"
         }
      }
   }
}

如果我使用实现 my_analyzer token 生成器的 whitespace，如果我使用curl检查结果将如下所示

  curl -XGET "index/_analyze?analyzer=my_analyzer&pretty=true" -d "reb-tn2000xxxl"
{
  "tokens" : [ {
    "token" : "reb-tn2000xxxl",
    "start_offset" : 0,
    "end_offset" : 14,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "reb",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "rebtn2000xxxl",
    "start_offset" : 0,
    "end_offset" : 14,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "tn",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2000",
    "start_offset" : 6,
    "end_offset" : 10,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "xxxl",
    "start_offset" : 10,
    "end_offset" : 14,
    "type" : "word",
    "position" : 3
  } ]
}

所以在这里我缺少 tn2000xxxl拆分，如果我使用 standard标记器而不是 whitespace可以获得，但是问题是一旦我使用 my_delimiter自定义分析器之类的标准就可以了。我在倒排索引中没有原始值。看来 standard tokinezer和 preserve_original过滤器一起无法使用。我在某处读到，因为标准 token 生成器在应用过滤器之前已经在原始对象上拆分了，所以这就是为什么原始对象不再是相同的原因。但是如何实现此任务以防止像标准 token 生成器一样拆分原始文档？

curl -XGET "index/_analyze?analyzer=my_delimiter&pretty=true" -d "reb-tn2000xxxl"
{  
   "tokens":[  
      {  
         "token":"reb",
         "start_offset":0,
         "end_offset":3,
         "type":"<ALPHANUM>",
         "position":0
      },
      {  
         "token":"tn2000xxxl",
         "start_offset":4,
         "end_offset":14,
         "type":"<ALPHANUM>",
         "position":1
      },
      {  
         "token":"tn",
         "start_offset":4,
         "end_offset":6,
         "type":"<ALPHANUM>",
         "position":1
      },
      {  
         "token":"tn2000xxxl",
         "start_offset":4,
         "end_offset":14,
         "type":"<ALPHANUM>",
         "position":1
      },
      {  
         "token":"2000",
         "start_offset":6,
         "end_offset":10,
         "type":"<ALPHANUM>",
         "position":2
      },
      {  
         "token":"xxxl",
         "start_offset":10,
         "end_offset":14,
         "type":"<ALPHANUM>",
         "position":3
      }
   ]
}

最佳答案

在Elasticsearch中，您可以在映射上具有多个字段。您描述的行为实际上很常见。您可以使用text分析器和standard字段来分析您的主要keyword字段。这是文档中使用多字段的映射示例。 https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}