gpt4 book ai didi

postgresql - 如何解释以下 PostgreSQL 查询计划

转载 作者:行者123 更新时间:2023-11-29 12:38:00 26 4
gpt4 key购买 nike

请注意:

(忘记加单了,计划更新了)

查询:

EXPLAIN ANALYZE
SELECT DISTINCT(id), special, customer, business_no, bill_to_name, bill_to_address1, bill_to_address2, bill_to_postal_code, ship_to_name, ship_to_address1, ship_to_address2, ship_to_postal_code,
purchase_order_no, ship_date::text, calc_discount_text(o) AS discount, discount_absolute, delivery, hst_percents, sub_total, total_before_hst, hst, total, total_discount, terms, rep, ship_via,
item_count, version, to_char(modified, 'YYYY-MM-DD HH24:MI:SS') AS "modified", to_char(created, 'YYYY-MM-DD HH24:MI:SS') AS "created"
FROM invoices o
LEFT JOIN reps ON reps.rep_id = o.rep_id
LEFT JOIN terms ON terms.terms_id = o.terms_id
LEFT JOIN shipVia ON shipVia.ship_via_id = o.ship_via_id
JOIN invoiceItems items ON items.invoice_id = o.id
WHERE items.qty < 5
ORDER BY modified
LIMIT 100

结果:

Limit  (cost=2931740.10..2931747.85 rows=100 width=635) (actual time=414307.004..414387.899 rows=100 loops=1)
-> Unique (cost=2931740.10..3076319.37 rows=1865539 width=635) (actual time=414307.001..414387.690 rows=100 loops=1)
-> Sort (cost=2931740.10..2936403.95 rows=1865539 width=635) (actual time=414307.000..414325.058 rows=2956 loops=1)
Sort Key: (to_char(o.modified, 'YYYY-MM-DD HH24:MI:SS'::text)), o.id, o.special, o.customer, o.business_no, o.bill_to_name, o.bill_to_address1, o.bill_to_address2, o.bill_to_postal_code, o.ship_to_name, o.ship_to_address1, o.ship_to_address2, (...)
Sort Method: external merge Disk: 537240kB
-> Hash Join (cost=11579.63..620479.38 rows=1865539 width=635) (actual time=1535.805..131378.864 rows=1872673 loops=1)
Hash Cond: (items.invoice_id = o.id)
-> Seq Scan on invoiceitems items (cost=0.00..78363.45 rows=1865539 width=4) (actual time=0.110..4591.117 rows=1872673 loops=1)
Filter: (qty < 5)
Rows Removed by Filter: 1405763
-> Hash (cost=5498.18..5498.18 rows=64996 width=635) (actual time=1530.786..1530.786 rows=64996 loops=1)
Buckets: 1024 Batches: 64 Memory Usage: 598kB
-> Hash Left Join (cost=113.02..5498.18 rows=64996 width=635) (actual time=0.214..1043.207 rows=64996 loops=1)
Hash Cond: (o.ship_via_id = shipvia.ship_via_id)
-> Hash Left Join (cost=75.35..4566.81 rows=64996 width=607) (actual time=0.154..754.957 rows=64996 loops=1)
Hash Cond: (o.terms_id = terms.terms_id)
-> Hash Left Join (cost=37.67..3800.33 rows=64996 width=579) (actual time=0.071..506.145 rows=64996 loops=1)
Hash Cond: (o.rep_id = reps.rep_id)
-> Seq Scan on invoices o (cost=0.00..2868.96 rows=64996 width=551) (actual time=0.010..235.977 rows=64996 loops=1)
-> Hash (cost=22.30..22.30 rows=1230 width=36) (actual time=0.044..0.044 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on reps (cost=0.00..22.30 rows=1230 width=36) (actual time=0.027..0.032 rows=4 loops=1)
-> Hash (cost=22.30..22.30 rows=1230 width=36) (actual time=0.067..0.067 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on terms (cost=0.00..22.30 rows=1230 width=36) (actual time=0.001..0.007 rows=3 loops=1)
-> Hash (cost=22.30..22.30 rows=1230 width=36) (actual time=0.043..0.043 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on shipvia (cost=0.00..22.30 rows=1230 width=36) (actual time=0.027..0.032 rows=4 loops=1)
Total runtime: 414488.582 ms

这显然很糟糕。我对解释查询计划还很陌生,想知道如何从这样的计划中提取有用的性能改进提示。

编辑 1

  • 此查询涉及两种实体 - 具有一对多关系的发票和发票项目。
  • 发票项目指定其在父发票中的数量。
  • 给定的查询返回 100 张发票,其中至少有一项数量少于 5。

这应该可以解释为什么我需要 DISTINCT - 一张发票可能有几项满足过滤器,但我不希望多次返回同一张发票。因此使用了 DISTINCT。但是,我完全意识到可能有比使用 DISTINCT 更好的方法来实现相同的语义 - 我非常愿意了解它们。

编辑 2

请在查询时在 invoiceItems 表中找到以下索引:

CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems (invoice_id);
CREATE INDEX invoiceitems_invoice_id_name_index ON invoiceitems (invoice_id, name varchar_pattern_ops);
CREATE INDEX invoiceitems_name_index ON invoiceitems (name varchar_pattern_ops);
CREATE INDEX invoiceitems_qty_index ON invoiceitems (qty);

编辑 3

https://stackoverflow.com/users/808806/yieldsfalsehood 给出的关于如何消除 DISTINCT(以及为什么)的建议被证明是一个非常好的建议。这是新查询:

EXPLAIN ANALYZE
SELECT id, special, customer, business_no, bill_to_name, bill_to_address1, bill_to_address2, bill_to_postal_code, ship_to_name, ship_to_address1, ship_to_address2, ship_to_postal_code,
purchase_order_no, ship_date::text, calc_discount_text(o) AS discount, discount_absolute, delivery, hst_percents, sub_total, total_before_hst, hst, total, total_discount, terms, rep, ship_via,
item_count, version, to_char(modified, 'YYYY-MM-DD HH24:MI:SS') AS "modified", to_char(created, 'YYYY-MM-DD HH24:MI:SS') AS "created"
FROM invoices o
LEFT JOIN reps ON reps.rep_id = o.rep_id
LEFT JOIN terms ON terms.terms_id = o.terms_id
LEFT JOIN shipVia ON shipVia.ship_via_id = o.ship_via_id
WHERE EXISTS (SELECT 1 FROM invoiceItems items WHERE items.invoice_id = id AND items.qty < 5)
ORDER BY modified DESC
LIMIT 100

这是新计划:

Limit  (cost=64717.14..64717.39 rows=100 width=635) (actual time=7830.347..7830.869 rows=100 loops=1)
-> Sort (cost=64717.14..64827.01 rows=43949 width=635) (actual time=7830.334..7830.568 rows=100 loops=1)
Sort Key: (to_char(o.modified, 'YYYY-MM-DD HH24:MI:SS'::text))
Sort Method: top-N heapsort Memory: 76kB
-> Hash Left Join (cost=113.46..63037.44 rows=43949 width=635) (actual time=2.322..6972.679 rows=64467 loops=1)
Hash Cond: (o.ship_via_id = shipvia.ship_via_id)
-> Hash Left Join (cost=75.78..50968.72 rows=43949 width=607) (actual time=0.650..3809.276 rows=64467 loops=1)
Hash Cond: (o.terms_id = terms.terms_id)
-> Hash Left Join (cost=38.11..50438.25 rows=43949 width=579) (actual time=0.550..3527.558 rows=64467 loops=1)
Hash Cond: (o.rep_id = reps.rep_id)
-> Nested Loop Semi Join (cost=0.43..49796.28 rows=43949 width=551) (actual time=0.015..3200.735 rows=64467 loops=1)
-> Seq Scan on invoices o (cost=0.00..2868.96 rows=64996 width=551) (actual time=0.002..317.954 rows=64996 loops=1)
-> Index Scan using invoiceitems_invoice_id_idx on invoiceitems items (cost=0.43..7.61 rows=42 width=4) (actual time=0.030..0.030 rows=1 loops=64996)
Index Cond: (invoice_id = o.id)
Filter: (qty < 5)
Rows Removed by Filter: 1
-> Hash (cost=22.30..22.30 rows=1230 width=36) (actual time=0.213..0.213 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on reps (cost=0.00..22.30 rows=1230 width=36) (actual time=0.183..0.192 rows=4 loops=1)
-> Hash (cost=22.30..22.30 rows=1230 width=36) (actual time=0.063..0.063 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on terms (cost=0.00..22.30 rows=1230 width=36) (actual time=0.044..0.050 rows=3 loops=1)
-> Hash (cost=22.30..22.30 rows=1230 width=36) (actual time=0.096..0.096 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on shipvia (cost=0.00..22.30 rows=1230 width=36) (actual time=0.071..0.079 rows=4 loops=1)
Total runtime: 7832.750 ms

这是我能指望的最好的吗?我已重新启动服务器(以清理数据库缓存)并在没有 EXPLAIN ANALYZE 的情况下重新运行查询。大约需要 5 秒钟。是否可以进一步改进?我有 65,000 张发票和 3,278,436 个发票项目。

编辑 4

找到了。我按计算结果排序,modified = to_char(modified, 'YYYY-MM-DD HH24:MI:SS')。在修改后的发票字段上添加索引并按字段本身排序可将结果缩短到 100 毫秒以下!

最终的方案是:

Limit  (cost=1.18..1741.92 rows=100 width=635) (actual time=3.002..27.065 rows=100 loops=1)
-> Nested Loop Left Join (cost=1.18..765042.09 rows=43949 width=635) (actual time=2.989..25.989 rows=100 loops=1)
-> Nested Loop Left Join (cost=1.02..569900.41 rows=43949 width=607) (actual time=0.413..16.863 rows=100 loops=1)
-> Nested Loop Left Join (cost=0.87..386185.48 rows=43949 width=579) (actual time=0.333..15.694 rows=100 loops=1)
-> Nested Loop Semi Join (cost=0.72..202470.54 rows=43949 width=551) (actual time=0.017..13.965 rows=100 loops=1)
-> Index Scan Backward using invoices_modified_index on invoices o (cost=0.29..155543.23 rows=64996 width=551) (actual time=0.003..4.543 rows=100 loops=1)
-> Index Scan using invoiceitems_invoice_id_idx on invoiceitems items (cost=0.43..7.61 rows=42 width=4) (actual time=0.079..0.079 rows=1 loops=100)
Index Cond: (invoice_id = o.id)
Filter: (qty < 5)
Rows Removed by Filter: 1
-> Index Scan using reps_pkey on reps (cost=0.15..4.17 rows=1 width=36) (actual time=0.007..0.008 rows=1 loops=100)
Index Cond: (rep_id = o.rep_id)
-> Index Scan using terms_pkey on terms (cost=0.15..4.17 rows=1 width=36) (actual time=0.003..0.004 rows=1 loops=100)
Index Cond: (terms_id = o.terms_id)
-> Index Scan using shipvia_pkey on shipvia (cost=0.15..4.17 rows=1 width=36) (actual time=0.006..0.008 rows=1 loops=100)
Index Cond: (ship_via_id = o.ship_via_id)
Total runtime: 27.572 ms

太棒了!谢谢大家的帮助。

最佳答案

对于初学者来说,将解释计划发布到 http://explain.depesz.com 是非常标准的做法- 这将为它添加一些漂亮的格式,为您提供分发计划的好方法,并让您匿名化可能包含敏感数据的计划。即使您没有分发计划,它也会让您更容易理解正在发生的事情,有时还可以准确说明瓶颈所在。

有无数的资源涵盖解释 postgres 解释计划的细节(参见 https://wiki.postgresql.org/wiki/Using_EXPLAIN)。数据库选择计划时需要考虑很多小细节,但有一些通用概念可以使其更容易。首先,了解数据和索引的基于页面的布局(您不需要了解页面格式的详细信息,只需了解数据和索引如何拆分到页面中即可)。从那里开始,了解两种基本的数据访问方法——全表扫描和索引扫描——稍微想一想,应该开始清楚在不同的情况下,一种比另一种更受欢迎(还要记住,索引扫描甚至并不总是可能的)。届时,您可以开始研究一些影响计划选择的不同配置项,了解它们如何使规模倾斜以支持表扫描或索引扫描。

一旦你了解了它,继续向上移动计划并阅读你找到的不同节点的详细信息 - 在这个计划中你有很多散列连接,所以首先阅读它.然后,为了同类比较,完全禁用散列连接(“set enable_hashjoin = false;”)并再次运行解释分析。现在您看到什么连接方法?仔细阅读。将该方法的估计成本与散列连接的估计成本进行比较。为什么它们可能不同?第二个计划的估计成本将高于第一个计划(否则它会首先被首选)但是运行第二个计划所需的实际时间如何?是低了还是高了?

最后,专门针对这个计划。关于那种需要很长时间的类型:distinct 不是一个函数。 “DISTINCT(id)”并没有说“给我所有仅在列 id 上不同的行”,而是对行进行排序并根据输出中的所有列获取唯一值(即等同于编写“不同的 ID ...”)。您可能应该重新考虑您是否真的需要那里的独特之处。规范化往往会设计出对 distinct 的需求,虽然偶尔会需要它们,但它们是否真的 super 真正需要并不总是如此。

关于postgresql - 如何解释以下 PostgreSQL 查询计划,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20424053/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com