gpt4 book ai didi

elasticsearch - 如何去除撇号?

转载 作者:行者123 更新时间:2023-12-03 01:48:58 25 4
gpt4 key购买 nike

在这里定义:

The apostrophe token filter strips all characters after an apostrophe, including the apostrophe itself.



试图去除撇号和它们后面的字符。当只有一个撇号时,过滤器根本不会剥离任何内容。同样,当存在多个顺序的撇号时,它将拆分相关的单词 ,但在撇号之后不剥离任何内容。显然,我一定错过了一些东西。

单引号输入:
POST localhost:9200/_analyze?
{
"filter": ["apostrophe"],
"text": "apple banana'orange kiwi"
}

输出量
{
"tokens": [
{
"token": "apple",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "banana'orange",
"start_offset": 6,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "kiwi",
"start_offset": 20,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 2
}
]
}

输入带有多个连续的撇号。
{
"filter": ["apostrophe"],
"text": "apple banana''orange kiwi"
}

输出量
{
"tokens": [
{
"token": "apple",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "banana",
"start_offset": 6,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "orange",
"start_offset": 14,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "kiwi",
"start_offset": 21,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 3
}
]
}

最佳答案

如果仅使用 token 过滤器,则将无法正常工作,因为standard分析器将启动并标记您的输入,并且apostrophe token 过滤器将被忽略。如果添加explain参数,您将获得有关正在发生的事情的更多信息:

curl -XPOST 'localhost:9200/_analyze?pretty&filter=apostrophe&explain' -d "apple banana'orange kiwi"
{
"detail" : {
"custom_analyzer" : false,
"analyzer" : {
"name" : "standard",
"tokens" : [ {
"token" : "apple",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0,
"bytes" : "[61 70 70 6c 65]",
"positionLength" : 1
}, {
"token" : "banana'orange",
"start_offset" : 6,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 1,
"bytes" : "[62 61 6e 61 6e 61 27 6f 72 61 6e 67 65]",
"positionLength" : 1
}, {
"token" : "kiwi",
"start_offset" : 20,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 2,
"bytes" : "[6b 69 77 69]",
"positionLength" : 1
} ]
}
}
}

如您所见,以上只是使用 standard分析器。

要解决此问题,您只需指定至少一个标记器即可。如果您使用 standard标记生成器,则它将按预期工作。您会看到您现在有了一个使用 standard标记生成器和 apostrophe标记过滤器的自定义分析器,它们现在可以正确地完成其工作。
curl -XPOST 'localhost:9200/_analyze?pretty&tokenizer=standard&filter=apostrophe&explain' -d "apple banana'orange kiwi"
{
"detail" : {
"custom_analyzer" : true,
"charfilters" : [ ],
"tokenizer" : {
"name" : "standard",
"tokens" : [ {
"token" : "apple",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0,
"bytes" : "[61 70 70 6c 65]",
"positionLength" : 1
}, {
"token" : "banana'orange",
"start_offset" : 6,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 1,
"bytes" : "[62 61 6e 61 6e 61 27 6f 72 61 6e 67 65]",
"positionLength" : 1
}, {
"token" : "kiwi",
"start_offset" : 20,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 2,
"bytes" : "[6b 69 77 69]",
"positionLength" : 1
} ]
},
"tokenfilters" : [ {
"name" : "apostrophe",
"tokens" : [ {
"token" : "apple",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0,
"bytes" : "[61 70 70 6c 65]",
"positionLength" : 1
}, {
"token" : "banana",
"start_offset" : 6,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 1,
"bytes" : "[62 61 6e 61 6e 61]",
"positionLength" : 1
}, {
"token" : "kiwi",
"start_offset" : 20,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 2,
"bytes" : "[6b 69 77 69]",
"positionLength" : 1
} ]
} ]
}
}

关于elasticsearch - 如何去除撇号?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42004448/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com