gpt4 book ai didi

elasticsearch - 带有数字标记的elasticsearch映射

转载 作者:行者123 更新时间:2023-12-02 23:44:09 25 4
gpt4 key购买 nike

我有下面的映射,它可以正常工作

{
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "0",
"analysis": {
"filter": {
"stemmer_plural_portugues": {
"name": "minimal_portuguese",
"stopwords" : ["http", "https", "ftp", "www"],
"type": "stemmer"
},


"synonym_filter": {
"type": "synonym",
"lenient": true,
"synonyms_path": "analysis/synonym.txt",
"updateable" : true

},


"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}

},

"analyzer": {
"analyzer_customizado": {
"filter": [
"lowercase",
"stemmer_plural_portugues",
"asciifolding",
"synonym_filter",
"shingle_filter"

],
"tokenizer": "lowercase"
}
}

}
}
},
"mappings": {
"properties": {

"id": {
"type": "long"
},
"data": {
"type": "date"
},
"quebrado": {
"type": "byte"

},
"pgrk": {
"type": "integer"
},
"url_length": {
"type": "integer"
},
"title": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"description": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"url": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
}
我在下面插入文档
{
"title": "rocket 1960",
"description": "space",
"url": "www.nasa.com"
}
如果我使用AND运算符执行以下查询,它将正常找到该文档,因为所有搜索到的单词都存在于该文档中。
{
"from": 0,
"size": 10,

"query": {


"multi_match": {
"query": "space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}

}
}
但是如果我在搜索中也输入“1960”,因为下面的查询不会返回任何内容
{
"from": 0,
"size": 10,

"query": {


"multi_match": {
"query": "1960 space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}

}
}
我发现我的“小写”标记生成器没有生成数字标记。因此,我将 token 生成器更改为“标准”,并生成了1960数字 token 。
但是查询没有任何结果,因为具有链接 www.nasa.com的URL字段不再生成 token “www nasa com”,生成的 token 是整个链接 www.nasa.com
该查询仅在输入完整的URL www.nasa.com时才起作用,如下所示
{
"from": 0,
"size": 10,

"query": {


"multi_match": {
"query": "1960 space www.nasa.com rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}

}
}
如果仅针对URL字段生成另一个“小写” token 生成器,则链接 www.nasa.com再次生成单独的 token “www nasa com”
但我在下面的查询中找不到任何内容,因为URL字段的标记符与其他字段的标题和描述不同。以下查询仅在使用OR运算符但需要AND运算符的情况下有效,
{
"from": 0,
"size": 10,

"query": {


"multi_match": {
"query": "1960 space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}

}
}
我无法在映射中使用Ngram,因为我使用了“词组建议程序”,并且当我使用Ngram时,正在生成带有数百个 token 的建议,这些 token 在建议中产生了不准确性。
谁知道我的映射能够在我的“标题和描述”字段中生成数字 token 的任何解决方案,但是我的URL字段将继续,将网站链接分为多个 token “www nasa com”,而不是将链接整个“www .nasa.com”,并且我的查询作为AND运算符同时在所有字段中进行搜索。

最佳答案

If I put it in the search also "1960" as the query below does notreturn anything


在下面的索引映射中,我删除了synonym_filter。将其删除并为示例文档建立索引,并运行与您在问题中提到的搜索查询相同的搜索查询后,我就能获得所需的结果
索引映射:
 {
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "0",
"analysis": {
"filter": {
"stemmer_plural_portugues": {
"name": "minimal_portuguese",
"stopwords": [
"http",
"https",
"ftp",
"www"
],
"type": "stemmer"
},
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
},
"analyzer": {
"analyzer_customizado": {
"filter": [
"lowercase",
"stemmer_plural_portugues",
"asciifolding",
"shingle_filter"
],
"tokenizer": "lowercase"
}
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "long"
},
"data": {
"type": "date"
},
"quebrado": {
"type": "byte"
},
"pgrk": {
"type": "integer"
},
"url_length": {
"type": "integer"
},
"title": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"description": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"url": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
}
搜索查询:
    {
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "1960 space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
搜索结果:
"hits": [
{
"_index": "my-index",
"_type": "_doc",
"_id": "1",
"_score": 0.9370217,
"_source": {
"title": "rocket 1960",
"description": "space",
"url": "www.nasa.com"
}
}
]
如@Gibbs所述,我认为 synonym_filter中存在一些问题,因此,如果您共享 synonym.txt会更好,否则,搜索查询运行得很好。
更新1 :(包括synonym_filter)
如果要包括同义词标记过滤器,则使索引映射与您的相同,只是在映射中进行一些更改,即:
 "synonym_filter": {
"type": "synonym",
"lenient": true,
"synonyms_path": "analysis/synonym.txt",
"updateable" : false --> set this to false

},

You set your synonym filter to "updateable", presumably because youwant to change synonyms without having to close and reopen the indexbut instead use the reload API. Updatable synonyms restrict theanalyzer they are used in to be only used at search time .


要获得对此的完整说明,您可以引用此ES discussion
使用与上述相同的搜索查询(在更改映射后
),您将获得理想的结果。

但是,如果仍要设置 "updateable" : true,则可以引用 Reload search analyzers API的官方文档

关于elasticsearch - 带有数字标记的elasticsearch映射,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62599759/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com