gpt4 book ai didi

elasticsearch - 如何使用elasticsearch正确处理多词同义词扩展?

转载 作者:行者123 更新时间:2023-12-02 23:53:25 24 4
gpt4 key购买 nike

我有以下同义词扩展:

suco => suco, refresco, bebida de soja

我想要的是通过这种方式标记搜索:

搜索“suco de laranja”将被标记为[“suco”,“laranja”,“refresco”,“bebida de soja”]。

但我将其标记为[“suco”,“laranja”,“refresco”,“bebida”,“soja”]。

考虑到“ de ”一词是停用词。我希望在“bebida de laranja”成为[“bebida”,“laranja”]之类的查询中被忽略。但是我不希望在同义词标记化上考虑它,因此“bebida de soja”仍然保留为一个标记“bebida de soja”。

我的设置 :
{
"settings":{
"analysis":{
"filter":{
"synonym_br":{
"type":"synonym",
"synonyms":[
"suco => suco, refresco, bebida de soja"
]
},
"brazilian_stop":{
"type":"stop",
"stopwords":"_brazilian_"
}
},
"analyzer":{
"synonyms":{
"filter":[
"synonym_br",
"lowercase",
"brazilian_stop",
"asciifolding"
],
"type":"custom",
"tokenizer":"standard"
}
}
}
}
}

最佳答案

我建议您进行以下两项更改。第一个与您提出的问题直接相关,第二个是建议。

  • 而不是使用多个同义词的扩展,而是进行相反的操作,即所有同义词都指向单个单词的同义词。因此,将"suco => suco, refresco, bebida de soja"更改为"suco, refresco, bebida de soja => suco"
  • synonyms分析器中更改过滤器的顺序。将lowercase放在synonym_br之前。这将确保大小写不会影响synonym_br token 过滤器。

  • 因此最终设置将是:
    {
    "settings": {
    "analysis": {
    "filter": {
    "synonym_br": {
    "type": "synonym",
    "synonyms": [
    "suco, refresco, bebida de soja => suco"
    ]
    },
    "brazilian_stop": {
    "type": "stop",
    "stopwords": "_brazilian_"
    }
    },
    "analyzer": {
    "synonyms": {
    "filter": [
    "lowercase",
    "synonym_br",
    "brazilian_stop",
    "asciifolding"
    ],
    "type": "custom",
    "tokenizer": "standard"
    }
    }
    }
    }
    }

    这是如何运作的?

    对于输入 bebida de soja过滤器,请按以下顺序应用:
    Input Filter        Result tokens
    ====================================
    lowercase bebida, de, soja
    synonym_br suco <------- all the above tokens(including position) exactly matches a synonym
    brazilian_stop suco
    asciifolding suco

    让我们看看 brazilian_stop的作用。为此,我们需要一个与同义词不匹配但其中包含 de的输入。例如。 de soja:
    Input Filter        Result tokens
    =================================
    lowercase de, soja
    synonym_br de, soja <------- none of the tokens (independently or combined(including position)) matches any synonym
    brazilian_stop soja <------- de is removed as it is a stopword
    asciifolding soja

    关于elasticsearch - 如何使用elasticsearch正确处理多词同义词扩展?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55944061/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com