gpt4 book ai didi

elasticsearch - 对 `search_as_you_type` ngram子字段感到困惑

转载 作者:行者123 更新时间:2023-12-03 01:10:50 24 4
gpt4 key购买 nike

我正在尝试在Elasticsearch中名为email_address的字段中添加“键入时搜索”功能。我对from the docs的理解是,如果我创建search_as_you_type字段,它应该自动创建为查找部分匹配而优化的ngram子字段。
但是,它似乎没有按照我期望的方式工作,并且我似乎也没有从这种特殊字段类型中获得期望的 yield 。
首先,我创建了一个带有以下内容的索引:

$ curl -s -H 'Content-Type: application/json' -XPUT http://localhost:9200/mytestindex -d '
{
"mappings": {
"properties": {
"email_address": {"type": "search_as_you_type"}
}
}
}
'
当我请求新创建的电子邮件字段时,将看到以下内容:
$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_mapping/field/email_address | json_pp
{
"mytestindex" : {
"mappings" : {
"email_address" : {
"full_name" : "email_address",
"mapping" : {
"email_address" : {
"max_shingle_size" : 3,
"type" : "search_as_you_type"
}
}
}
}
}
}
最后,我填充了一些示例数据:
$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_doc -d '
{"email_address": "sam@example.com"}'

$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_doc -d '
{"email_address": "sally@example.com"}'

$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_doc -d '
{"email_address": "jane@example.com"}'

$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_doc -d '
{"email_address": "samantha@example.com"}'
官方文档建议使用带有以下字段的 bool_prefix multi_match搜索: email_addressemail_address._2gramemail_address._3gram。好奇地尝试子字段,我测试了仅包含子字段的搜索,但无法获得任何结果:
$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_search -d '
{
"query": {
"multi_match": {
"query": "sa",
"type": "bool_prefix",
"fields": [
"email_address._2gram",
"email_address._3gram"
]
}
}
}
' | json_pp

{
"hits" : {
"hits" : [],
"max_score" : null,
"total" : {
"value" : 0,
"relation" : "eq"
}
},
"took" : 4,
"_shards" : {
"skipped" : 0,
"successful" : 1,
"total" : 1,
"failed" : 0
},
"timed_out" : false
}
我尝试了各种长度的部分查询( ssasam等),但我从未得到任何结果。
当我执行相同的搜索但只包括 email_address字段本身时,我得到了所有期望的结果:
curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_search -d '
{
"query": {
"multi_match": {
"query": "sa",
"type": "bool_prefix",
"fields": [
"email_address"
]
}
}
}
' | json_pp
{
"timed_out" : false,
"hits" : {
"max_score" : 1,
"total" : {
"relation" : "eq",
"value" : 3
},
"hits" : [
{
"_index" : "mytestindex",
"_id" : "gEbkCXUBC6_J-EeLAygM",
"_score" : 1,
"_type" : "_doc",
"_source" : {
"email_address" : "sam@example.com"
}
},
{
"_index" : "mytestindex",
"_source" : {
"email_address" : "sally@example.com"
},
"_score" : 1,
"_type" : "_doc",
"_id" : "gUbkCXUBC6_J-EeLWigu"
},
{
"_index" : "mytestindex",
"_id" : "jUb5CXUBC6_J-EeL1ij1",
"_type" : "_doc",
"_score" : 1,
"_source" : {
"email_address" : "samantha@example.com"
}
}
]
},
"took" : 2,
"_shards" : {
"failed" : 0,
"skipped" : 0,
"successful" : 1,
"total" : 1
}
}
结果,我不明白 _2gram_3gram子字段提供了什么好处。我设置不正确吗?还是我对这些 Realm 的实际目的感到困惑?

最佳答案

The search_as_you_type field type is a text-like field that isoptimized to provide support for queries that serve an as-you-typecompletion use case


添加带有索引数据,映射,搜索查询和搜索结果的工作示例
索引映射:
{
"mappings": {
"properties": {
"title": {
"type": "search_as_you_type"
}
}
}
}
索引数据:
{"title": "how shingles are actually used"}
分析API
elasticsearch中的默认标记器是“标准标记器”,它使用基于语法的标记化技术
为文本生成的各个标记是
{
"tokens": [
{
"token": "how",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "shingles",
"start_offset": 4,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "are",
"start_offset": 13,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "actually",
"start_offset": 17,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "used",
"start_offset": 26,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 4
}
]
}
产生3个单词的带状疱疹
POST/_analyze

{
"tokenizer": "standard",
"filter": [
{
"type": "shingle",
"min_shingle_size": 3,
"max_shingle_size": 3,
"output_unigrams":false
}
],
"text": "how shingles are actually used"
}
生成的 token 为:
{
"tokens": [
{
"token": "how shingles are",
"start_offset": 0,
"end_offset": 16,
"type": "shingle",
"position": 0
},
{
"token": "shingles are actually",
"start_offset": 4,
"end_offset": 25,
"type": "shingle",
"position": 1
},
{
"token": "are actually used",
"start_offset": 13,
"end_offset": 30,
"type": "shingle",
"position": 2
}
]
}
搜索查询:

title._3gram - Wraps the analyzer of my_field with a shingle tokenfilter of shingle size 3

{
"query": {
"multi_match": {
"query": "shingles are actually",
"type": "bool_prefix",
"fields": [
"title._3gram"
]
}
}
}
搜索结果:
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"title": "how shingles are actually used"
}
}
]
在您的情况下,考虑到 "text": "samantha@example.com",生成的各个 token 是: samanthaexample.com当创建2个单词的带状疱疹时,生成的标记为:
{
"tokens": [
{
"token": "samantha example.com",
"start_offset": 0,
"end_offset": 20,
"type": "shingle",
"position": 0
}
]
}
因此,当您使用sa搜索时,它将不匹配,因为不会生成与之相对应的 token 。
在 bool(boolean) 前缀查询中使用多重匹配时(在email_address字段上,由于" type": "bool prefix"而匹配。阅读此内容以了解有关Match bool prefix query的更多信息。

如果要使用 sa查询并获得所有结果,则可以使用 Completion suggestor,甚至可以遍历 UAX URL Email Tokenizer

关于elasticsearch - 对 `search_as_you_type` ngram子字段感到困惑,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64270635/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com