gpt4 book ai didi

elasticsearch - 优先考虑某些字段的ES搜索结果

转载 作者:行者123 更新时间:2023-12-02 22:54:48 27 4
gpt4 key购买 nike

我正在使用elasticsearch-6.4.3。我创建了一个索引flight-location_methods

      settings index: {
analysis: {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "autocomplete_filter"]
}
}
}
}

mapping do
indexes :airport_code, type: "text", analyzer: "autocomplete", search_analyzer: "standard"
indexes :airport_name, type: "text", analyzer: "autocomplete", search_analyzer: "standard"
indexes :city_name, type: "text", analyzer: "autocomplete", search_analyzer: "standard"
indexes :country_name, type: "text", analyzer: "autocomplete", search_analyzer: "standard"
end

上面的代码片段来自我为索引创建的 represents the mapping的Ruby代码。

当我执行此查询时:
GET /flight-location_methods/_search
{
"from": 0,
"size": 1000,
"query": {
"function_score": {
"functions": [
{
"filter": {
"match": {
"city_name": "new yo"
}
},
"weight": 50
},
{
"filter": {
"match": {
"country_name": "new yo"
}
},
"weight": 50
}
],
"max_boost": 200,
"score_mode": "max",
"boost_mode": "multiply",
"min_score": 10
}
}
}

我得到这个结果:
  {
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "tcoj1G0Bdo5Q9AduxCKi",
"_score": 50,
"_source": {
"airport_name": "Ouvea",
"airport_code": "UVE",
"city_name": "Ouvea",
"country_name": "New Caledonia"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "zMoj1G0Bdo5Q9AduxCKi",
"_score": 50,
"_source": {
"airport_name": "Palmerston North",
"airport_code": "PMR",
"city_name": "Palmerston North",
"country_name": "New Zealand"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "1Moj1G0Bdo5Q9AduxCKi",
"_score": 50,
"_source": {
"airport_name": "Westport",
"airport_code": "WSZ",
"city_name": "Westport",
"country_name": "New Zealand"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "1coj1G0Bdo5Q9AduxCKi",
"_score": 50,
"_source": {
"airport_name": "Whangarei",
"airport_code": "WRE",
"city_name": "Whangarei",
"country_name": "New Zealand"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "Rsoj1G0Bdo5Q9AduxCOi",
"_score": 50,
"_source": {
"airport_name": "Municipal",
"airport_code": "RNH",
"city_name": "New Richmond",
"country_name": "United States"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "fsoj1G0Bdo5Q9AduxCOi",
"_score": 50,
"_source": {
"airport_name": "New London",
"airport_code": "GON",
"city_name": "New London",
"country_name": "United States"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "gMoj1G0Bdo5Q9AduxCOi",
"_score": 50,
"_source": {
"airport_name": "New Ulm",
"airport_code": "ULM",
"city_name": "New Ulm",
"country_name": "United States"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "5coj1G0Bdo5Q9AduxCSi",
"_score": 50,
"_source": {
"airport_name": "Cape Newenham",
"airport_code": "EHM",
"city_name": "Cape Newenham",
"country_name": "United States"
}
},
{
"_index": "flight-location_methods",
"_type": "_doc",
"_id": "Ycoj1G0Bdo5Q9AduxCWi",
"_score": 50,
"_source": {
"airport_name": "East 60th Street H/P",
"airport_code": "JRE",
"city_name": "New York",
"country_name": "United States"
}
}

如您所见, New York should be on top但实际上不是。

另外,我使用 can not use AND operator是因为如果搜索文本包含多个单词,我希望搜索文本中的任何单词出现在任何字段中。但是,如果所有搜索文本都在一个字段中,则优先级应该更高。

最佳答案

让我们首先讨论elasticsearch标记化程序和标记化过程:

A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words). ES docs



现在,让我们描述 自动完成分析器的工作方式:
  • 提供了标准 token 生成器 token 作为标准elasticsearch token 生成器(为简化起见,我们说这是单词)
  • 小写过滤器使所有字符变小。
  • 然后edge_ngram过滤器将每个单词分解为 token 。

  • 从这里开始魔术:我认为您对1到20的 token 的定义太多了。可能存在包含10个以上字符的单词,但对于我们而言,这是不相关的。同样,仅包含一个对我们不可用的字符的 token 。我改变它:
       "filter": {
    "autocomplete_filter": {
    "type": "edge_ngram",
    "min_gram": 2,
    "max_gram": 5
    }
    }

    然后在我们的索引中将有很多单词部分,长度从2到5个字符。现在,当我们知道要搜索的内容时,就可以创建映射并编写查询:
    {
    "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 0,
    "analysis": {
    "filter": {
    "autocomplete_filter": {
    "type": "edge_ngram",
    "min_gram": 2,
    "max_gram": 5
    }
    },
    "analyzer": {
    "autocomplete": {
    "type": "custom",
    "tokenizer": "standard",
    "filter": [
    "lowercase",
    "autocomplete_filter"
    ]
    }
    }
    }
    },
    "mappings": {
    "_doc": {
    "properties": {
    "airport_name": {
    "type": "text",
    "fields": {
    "ngram": {
    "type": "text",
    "analyzer": "autocomplete"
    }
    }
    },
    "airport_code": {
    "type": "keyword",
    "fields": {
    "ngram": {
    "type": "text",
    "analyzer": "autocomplete"
    }
    }
    },
    "city_name": {
    "type": "keyword",
    "fields": {
    "ngram": {
    "type": "text",
    "analyzer": "autocomplete"
    }
    }
    },
    "country_name": {
    "type": "keyword",
    "fields": {
    "ngram": {
    "type": "text",
    "analyzer": "autocomplete"
    }
    }
    }
    }
    }
    }
    }

    我使用ngram字段和常规字段来制作字段,以保持进行聚合的能力。例如,通过多个机场查找城市是很好的。

    现在我们可以运行一个简单的查询来获取纽约:
    {
    "size": 20,
    "query": {
    "query_string": {
    "default_field": "city_name.ngram",
    "query": "new yo",
    "default_operator": "AND"
    }
    }
    }

    Answer
    {
    "took": 15,
    "timed_out": false,
    "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
    },
    "hits": {
    "total": 1,
    "max_score": 13.896059,
    "hits": [
    {
    "_index": "test-index",
    "_type": "_doc",
    "_id": "BtBD2W0BCDulLSY6pKM8",
    "_score": 13.896059,
    "_source": {
    "airport_name": "Flushing",
    "airport_code": "FLU",
    "city_name": "New York",
    "country_name": "United States"
    }
    }
    ]
    }
    }

    或者使用boosting创建 boostingtext查询。在大数据列表上进行查询时,这也将更加有效。

    您的查询应如下所示:
    {
    "query": {
    "function_score": {
    "query": {
    "query_string": {
    "query": "new yo",
    "analyzer": "autocomplete"
    }
    },
    "functions": [
    {
    "filter": {"terms": {
    "city_name.ngram": [
    "new",
    "yo"
    ]
    }},
    "weight": 2
    },
    {
    "filter": {"terms": {
    "country_name.ngram": [
    "new",
    "yo"
    ]
    }},
    "weight": 2
    }
    ],
    "max_boost": 30,
    "min_score": 5,
    "score_mode": "max",
    "boost_mode": "multiply"
    }
    }
    }

    在此查询中,纽约将是第一个,因为我们通过查询部分过滤了所有不相关的文档。并乘以2 city_name.ngram字段分数,在此字段中,我们有2个 token ,那么此字段将获得最高分数。同样,查询的底线是min_score,它过滤而不是相关文档。您可以阅读有关当前的Elasticsearch相关算法 here的信息。
    顺便说一句,我不想​​将过滤器放在权重相同的函数中。您应该决定是否是更重要的 Realm 。这使您的搜索更加清晰。

    关于elasticsearch - 优先考虑某些字段的ES搜索结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58424832/

    27 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com