gpt4 book ai didi

跳过 Elasticsearch 分区索引与匹配无文档查询

转载 作者:行者123 更新时间:2023-12-02 22:13:02 28 4
gpt4 key购买 nike

我们有按年份分区的索引,例如:

items-2019
items-2020
考虑以下数据:
POST items-2019/_doc
{
"@timestamp": "2019-01-01"
}

POST items-2020/_doc
{
"@timestamp": "2020-01-01"
}


POST /_aliases
{
"actions": [
{
"add": {
"index": "items-*",
"alias": "items"
}
}
]
}
现在,当我查询数据并对结果进行显式排序时,它会跳过 items-2020碎片:
GET items/_search
{
"query": {
"range": {
"@timestamp": {
"lt": "2020-01-01"
}
}
},
"sort": {
"@timestamp": "desc"
}
}
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 1, <--- skipped
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "items-2019",
"_type" : "_doc",
"_id" : "BTdSb3UBRFH0Yqe1vm_W",
"_score" : null,
"_source" : {
"@timestamp" : "2019-01-01"
},
"sort" : [
1546300800000
]
}
]
}
}
但是,当我不明确对结果进行排序时,它不会跳过分片,但是 ES 会发出 MatchNoDocsQuery:
GET items/_search
{
"profile": "true",
"query": {
"range": {
"@timestamp": {
"lt": "2020-01-01"
}
}
}
}
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0, <--- nothing skipped
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "items-2019",
"_type" : "_doc",
"_id" : "BTdSb3UBRFH0Yqe1vm_W",
"_score" : 1.0,
"_source" : {
"@timestamp" : "2019-01-01"
}
}
]
},
"profile" : {
"shards" : [
{
"id" : "[Axyv60mYQEGAREa2TwbgMQ][items-2019][0]",
"searches" : [
{
"query" : [
{
"type" : "ConstantScoreQuery",
"description" : "ConstantScore(DocValuesFieldExistsQuery [field=@timestamp])",
"time_in_nanos" : 69525,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 3766,
"match" : 0,
"next_doc_count" : 1,
"score_count" : 1,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 4123,
"advance_count" : 1,
"score" : 1123,
"build_scorer_count" : 2,
"create_weight" : 29745,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 30768
},
"children" : [
{
"type" : "DocValuesFieldExistsQuery",
"description" : "DocValuesFieldExistsQuery [field=@timestamp]",
"time_in_nanos" : 18317,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 1474,
"match" : 0,
"next_doc_count" : 1,
"score_count" : 0,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 1541,
"advance_count" : 1,
"score" : 0,
"build_scorer_count" : 2,
"create_weight" : 1184,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 14118
}
}
]
}
],
"rewrite_time" : 4660,
"collector" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time_in_nanos" : 22374
}
]
}
],
"aggregations" : [ ]
},
{
"id" : "[Axyv60mYQEGAREa2TwbgMQ][items-2020][0]",
"searches" : [
{
"query" : [
{
"type" : "MatchNoDocsQuery",
"description" : """MatchNoDocsQuery("User requested "match_none" query.")""", <-- here
"time_in_nanos" : 4166,
"breakdown" : {
"set_min_competitive_score_count" : 0,
"match_count" : 0,
"shallow_advance_count" : 0,
"set_min_competitive_score" : 0,
"next_doc" : 0,
"match" : 0,
"next_doc_count" : 0,
"score_count" : 0,
"compute_max_score_count" : 0,
"compute_max_score" : 0,
"advance" : 0,
"advance_count" : 0,
"score" : 0,
"build_scorer_count" : 1,
"create_weight" : 1791,
"shallow_advance" : 0,
"create_weight_count" : 1,
"build_scorer" : 2375
}
}
],
"rewrite_time" : 4353,
"collector" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time_in_nanos" : 12887
}
]
}
],
"aggregations" : [ ]
}
]
}
}
所以这里有几个问题:
  • 跳过真的会跳过分片吗?
  • 跳过的分片和 MatchNoDocsQuery 有何不同?
  • MatchNoDocsQuery 的成本是多少?
  • 排序如何允许跳过分片?
  • 如果我们对结果进行排序,我们真的会完全跳过分片,甚至在搜索过程中都不碰它们吗?
  • 最佳答案

    这是很多问题捆绑在一起,但这是我的尝试:

    Does skipping truly skip shards?

    How does sorting allow shards to be skipped?

    If we sort results, do we really completely skip shards and not even touch them during search?


    是的,ES 试图足够聪明,以便在实际将查询发送到这些分片之前确定要命中哪些分片。 _search_shards API在这里有所帮助,但不仅可以从 this issue 中的解释中看出.
    如果您 search issues对于关键字 can_match , skipshard您会发现许多其他优化都在各处实现,旨在使 ES 执行计划更智能、更快。
    如果你想看看这是如何编码的,你可以从 SearchService.canMatch() 开始方法。这是服务可以决定查询是否可以重写为 MatchNoDocsQuery 的地方。 .如果您添加 suggestglobal聚合(无论如何都必须访问所有文档),您将看到不再跳过分片,即使使用 sort展示。

    What's the cost of MatchNoDocsQuery?


    我不会担心它,因为它不仅可以忽略不计,而且超出了你的掌控。

    How does sorting allow shards to be skipped?


    正如我在上面链接的问题 #51852 中所述, This change will rewrite the shard queries to match none if the bottom sort value computed in prior shards is better than all values in the shard.换句话说,ES 足够聪明,可以根据排序值知道哪些将包含有效的命中。在您的情况下,由于时间戳的排序排除了 2020 年的所有值,因此 ES 知道可以排除 2020 年索引中的分片,因为没有一个会匹配。
    另一种可能性是杠杆 index sorting以便在索引时对术语进行排序。术语在索引的每个段中进行排序,但每次合并段时,都需要再次使用新的合并术语集,因此这可能会对性能产生影响。使用前测试!

    关于跳过 Elasticsearch 分区索引与匹配无文档查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64573690/

    28 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com