elasticsearch - preserve_original elasticsearch 中的原始标记-6ren

elasticsearch - preserve_original elasticsearch 中的原始标记

转载作者：行者123 更新时间：2023-12-02 22:27:29

我有一个 token 过滤器和分析器，如下所示。但是，我无法保留原始 token 。例如，如果我使用 _analyze 这个词:saint-louis，我只返回 saintlouis，而我希望得到两个 saintlouis 和 saint-louis，因为我将 preserve_original 设置为 true。 我使用的ES版本是6.3.2，Lucene版本是7.3.1

"analysis": {
  "filter": {
    "hyphenFilter": {
      "pattern": "-",
      "type": "pattern_replace",
      "preserve_original": "true",
      "replacement": ""
    }
  },
  "analyzer": {
    "whitespace_lowercase": {
      "filter": [
        "lowercase",
        "asciifolding",
        "hyphenFilter"
      ],
      "type": "custom",
      "tokenizer": "whitespace"
    }
  }
}

最佳答案

看起来 preserve_original 在 pattern_replace token 过滤器上不受支持，至少在我使用的版本上不支持。

我做了如下的解决方法:

索引定义

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "whitespace",
                    "type": "custom",
                    "filter": [
                        "lowercase",
                        "hyphen_filter"
                    ]
                }
            },
            "filter": {
                "hyphen_filter": {
                    "type": "word_delimiter",
                    "preserve_original": "true",
                    "catenate_words": "true"
                }
            }
        }
    }
}

例如，这会将像 anti-spam 这样的词标记为 antispam(删除连字符)，anti-spam(保留原始)、反和垃圾邮件。

用于查看生成的 token 的分析器 API

POST/_分析

{ "text": "反垃圾邮件", “分析器”:“my_analyzer”

分析 API 的输出，即。生成的 token

{
    "tokens": [
        {
            "token": "anti-spam",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "anti",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "antispam",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "spam",
            "start_offset": 5,
            "end_offset": 9,
            "type": "word",
            "position": 1
        }
    ]
}

关于elasticsearch - preserve_original elasticsearch 中的原始标记，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60441801/

文章推荐： security - 追溯保护Elasticsearch安装

文章推荐： angularjs - ng-pattern 用于字母数字和所有特殊符号字符

elasticsearch - preserve_original elasticsearch 中的原始标记
我有一个 token 过滤器和分析器，如下所示。但是，我无法保留原始 token 。例如，如果我使用 _analyze 这个词:saint-louis，我只返回 saintlouis，而我希望得到两个

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城