gpt4 book ai didi

elasticsearch - 用于国际语言的Elasticsearch标记化

转载 作者:行者123 更新时间:2023-12-03 02:05:29 24 4
gpt4 key购买 nike

我想了解一下Elasticsearch如何标记英语以外的其他语言,然后尝试了它提供的analytics API。但是我根本看不懂输出。举个例子

GET myindex/_analyze?analyzer=hindi&text="में कहता हूँ और तुम सुनना "

现在在上面的文本中总共有6个单词,所以我期望最多6个标记(认为文本不包含停用词),但是输出有点像这样
 {
"tokens": [
{
"token": "2350",
"start_offset": 3,
"end_offset": 7,
"type": "<NUM>",
"position": 1
},
{
"token": "2375",
"start_offset": 10,
"end_offset": 14,
"type": "<NUM>",
"position": 2
},
{
"token": "2306",
"start_offset": 17,
"end_offset": 21,
"type": "<NUM>",
"position": 3
},
{
"token": "2325",
"start_offset": 25,
"end_offset": 29,
"type": "<NUM>",
"position": 4
},
{
"token": "2361",
"start_offset": 32,
"end_offset": 36,
"type": "<NUM>",
"position": 5
},
{
"token": "2340",
"start_offset": 39,
"end_offset": 43,
"type": "<NUM>",
"position": 6
},
{
"token": "2366",
"start_offset": 46,
"end_offset": 50,
"type": "<NUM>",
"position": 7
},
{
"token": "2361",
"start_offset": 54,
"end_offset": 58,
"type": "<NUM>",
"position": 8
},
{
"token": "2370",
"start_offset": 61,
"end_offset": 65,
"type": "<NUM>",
"position": 9
},
{
"token": "2305",
"start_offset": 68,
"end_offset": 72,
"type": "<NUM>",
"position": 10
},
{
"token": "2324",
"start_offset": 76,
"end_offset": 80,
"type": "<NUM>",
"position": 11
},
{
"token": "2352",
"start_offset": 83,
"end_offset": 87,
"type": "<NUM>",
"position": 12
},
{
"token": "2340",
"start_offset": 91,
"end_offset": 95,
"type": "<NUM>",
"position": 13
},
{
"token": "2369",
"start_offset": 98,
"end_offset": 102,
"type": "<NUM>",
"position": 14
},
{
"token": "2350",
"start_offset": 105,
"end_offset": 109,
"type": "<NUM>",
"position": 15
},
{
"token": "2360",
"start_offset": 113,
"end_offset": 117,
"type": "<NUM>",
"position": 16
},
{
"token": "2369",
"start_offset": 120,
"end_offset": 124,
"type": "<NUM>",
"position": 17
},
{
"token": "2344",
"start_offset": 127,
"end_offset": 131,
"type": "<NUM>",
"position": 18
},
{
"token": "2344",
"start_offset": 134,
"end_offset": 138,
"type": "<NUM>",
"position": 19
},
{
"token": "2366",
"start_offset": 141,
"end_offset": 145,
"type": "<NUM>",
"position": 20
}
]
}

这意味着不是六个 Elasticsearch 检测到了大约20个 token ,而所有类型都是NUM(我不知道那是什么)
我真的很困惑为什么会这样。有人可以启发我发生了什么事吗?我做错了什么或我缺乏理解?

最佳答案

您如何调用elasticsearch API-客户端可能会将北印度语字符弄乱了?

在带有curl的Linux上,它对我来说可以正常工作(至少在结果中出现北印度语字符):

curl -XPOST 'http://localhost:9200/myindex/_analyze?analyzer=hindi&pretty' -d 'में कहता हूँ और तुम सुनना '
{
"tokens" : [ {
"token" : "कह",
"start_offset" : 4,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "हुं",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "तुम",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 5
}, {
"token" : "सुन",
"start_offset" : 20,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 6
} ]
}

关于elasticsearch - 用于国际语言的Elasticsearch标记化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27204925/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com