gpt4 book ai didi

elasticsearch - elasticsearch上下文建议者停用词

转载 作者:行者123 更新时间:2023-12-03 00:23:58 25 4
gpt4 key购买 nike

有没有一种方法可以分析传递给上下文建议者的字段?
例如,如果我的映射中包含以下内容:

mappings: {
myitem: {
title: {type: 'string'},
content: {type: 'string'},
user: {type: 'string', index: 'not_analyzed'},
suggest_field: {
type: 'completion',
payloads: false,
context: {
user: {
type: 'category',
path: 'user'
},
}
}
}
}

我索引此文档:
POST /myindex/myitem/1
{
title: "The Post Title",
content: ...,
user: 123,
suggest_field: {
input: "The Post Title",
context: {
user: 123
}
}
}

我想先分析输入,将其拆分为单独的单词,通过小写形式运行它,然后停止单词过滤器,以便上下文建议者实际上得到
    suggest_field: {
input: ["post", "title"],
context: {
user: 123
}
}

我知道我可以将数组传递给建议字段,但在传递给ES之前,我想避免在应用程序中对文本进行小写,分割,运行停用词过滤器。如果可能的话,我宁愿ES为我这样做。我确实尝试过将index_analyzer添加到字段映射中,但是似乎没有实现任何目的。

或者,还有另一种方法来获取单词的自动完成建议吗?

最佳答案

好的,这很复杂,但是我认为它或多或少地满足您的要求。我将不解释全部内容,因为这将花费大量时间。但是,我会说我从this blog post开始并添加了stop token filter"title"字段具有子字段(以前称为multi_field),这些子字段使用不同的分析器,或者不使用任何分析器。该查询包含几个terms aggregations。还要注意,聚合结果由匹配查询过滤,仅返回与文本查询相关的结果。

这是索引设置(花一些时间浏览一下;如果您有特定问题,我将尝试回答这些问题,但我建议您先阅读博客文章):

DELETE /test_index

PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
"stop_filter": {
"type": "stop"
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop_filter",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop_filter"
]
},
"stopword_only_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"asciifolding",
"stop_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"stopword_only": {
"type": "string",
"analyzer": "stopword_only_analyzer"
}
}
}
}
}
}
}

然后我添加了一些文档:
PUT /test_index/_bulk
{"index": {"_index":"test_index", "_type":"doc", "_id":1}}
{"title": "The Lion King"}
{"index": {"_index":"test_index", "_type":"doc", "_id":2}}
{"title": "Beauty and the Beast"}
{"index": {"_index":"test_index", "_type":"doc", "_id":3}}
{"title": "Alladin"}
{"index": {"_index":"test_index", "_type":"doc", "_id":4}}
{"title": "The Little Mermaid"}
{"index": {"_index":"test_index", "_type":"doc", "_id":5}}
{"title": "Lady and the Tramp"}

现在,我可以根据需要搜索带有单词前缀的文档(或完整单词,是否为大写),并使用聚合返回匹配文档的完整标题和完整(非小写)单词,减去停用词:
POST /test_index/_search?search_type=count
{
"query": {
"match": {
"title": {
"query": "mer king",
"operator": "or"
}
}
},
"aggs": {
"word_tokens": {
"terms": { "field": "title.stopword_only" }
},
"intact_titles": {
"terms": { "field": "title.raw" }
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"intact_titles": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "The Lion King",
"doc_count": 1
},
{
"key": "The Little Mermaid",
"doc_count": 1
}
]
},
"word_tokens": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "The",
"doc_count": 2
},
{
"key": "King",
"doc_count": 1
},
{
"key": "Lion",
"doc_count": 1
},
{
"key": "Little",
"doc_count": 1
},
{
"key": "Mermaid",
"doc_count": 1
}
]
}
}
}

请注意,返回了 "The"。这似乎是因为默认 _english_停用词仅包含 "the"。我没有立即找到解决方法。

这是我使用的代码:

http://sense.qbox.io/gist/2fbb8a16b2cd35370f5d5944aa9ea7381544be79

让我知道这是否可以帮助您解决问题。

关于elasticsearch - elasticsearch上下文建议者停用词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27946812/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com