gpt4 book ai didi

postgresql - 在 PostgreSQL 中使用多个表进行大更新

转载 作者:行者123 更新时间:2023-11-29 13:29:27 33 4
gpt4 key购买 nike

几周前我开始使用 Postgres,最近几天我一直在尝试解决这个问题,结果好坏参半:我有一个表(1000 万行)需要每月更新一次信息存储在其他几个表中。在此更新期间,没有人可以访问数据库,因此没有人正在读取/写入数据。在那段时间我将是唯一的数据库用户。

表 A 有 1000 万行。一栏(重建成本)需要更新。表 B、表 C 和表 D 为表 A 中的每一行重新计算了重建成本(因此表 B、C 和 D 各有 1000 万行)。哪个表用于更新表 A 取决于值,请参见下文。我一直在尝试使用此查询进行此更新:

UPDATE table_A a
SET rebuilding_costs =
CASE
WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs
WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs
WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs
ELSE NULL
END
FROM table_B b
LEFT OUTER JOIN table_C c
ON (b.address = c.address)
LEFT OUTER JOIN table_D d
ON (b.address = d.address)
WHERE a.address = b.address
;

这个查询给出了正确的结果,但是有点慢(25 分钟)。有趣的是:当表 A 有 100 万行(而不是 1000 万行)时,只需要 30 秒。因此,当在具有 1000 万行的表上运行此查询时,我预计该查询将运行 5 分钟,但实际运行了 25 分钟。那是当我尝试在 block 中更新时,所以我将此行添加到 WHERE 子句中的查询:

AND (a.id > 0 AND a.id < 1000000)

a.id 是表 A 的主键。在此示例中,仅更新表 A 的前百万行。但是,需要 3 分钟才能完成。你必须这样做 10 次才能更新所有 1000 万行,所以这将是 30 分钟..

我还试图通过添加以下内容来防止查询使用 UPDATE 不会更改的行来执行查询:

AND a.herbouwwaarde_indicatie IS DISTINCT FROM b.inhoud
AND a.herbouwwaarde_indicatie IS DISTINCT FROM c.inhoud
AND a.herbouwwaarde_indicatie IS DISTINCT FROM d.inhoud

将 fillfactor 设置为 70 和 50 确实有助于提高一点速度,但我无法在 20 分钟内完成。

我还尝试重新创建表 A,例如:

CREATE TABLE table_A_new

AS

SELECT a.address,

CASE
WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs
WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs
WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs
ELSE NULL
END rebuildingcosts

FROM table_A a
LEFT OUTER JOIN table_B b
ON (a.address = b.address)
LEFT OUTER JOIN table_C c
ON (a.address = c.address)
LEFT OUTER JOIN table_D d
ON (a.address = d.address)
;

这非常快(2 分钟)并给出了正确的结果。但是,您必须重新创建整个表。不知何故,这似乎不是很有效(并且占用大量存储空间),这就是我开始使用 UPDATE 查询的原因。

我的问题是:最好的方法是什么?有没有一种方法可以提高 UPDATE 查询的性能,或者最好寻找一种替代方法,例如上面示例中的“创建表”。

虽然我必须使用 Postrgres,但我不能切换到不同的 DBMS。

这是 UPDATE 查询的执行计划(现在表没有 1000 万行,但大约有 6 - 800 万行):

Update on tabel_A a  (cost=902288.27..2150690.80 rows=6714762 width=65)
-> Hash Join (cost=902288.27..2150690.80 rows=6714762 width=65)
Hash Cond: ((b.adres)::text = a.adres)"
Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs)
AND (c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
AND (d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
-> Hash Left Join (cost=522527.27..1318059.42 rows=6716471 width=39)
Hash Cond: ((b.adres)::text = (d.adres)::text)
-> Hash Right Join (cost=295916.60..817658.93 rows=6716471 width=29)
Hash Cond: ((c.adres)::text = (b.adres)::text)
-> Seq Scan on Tabel_C c (cost=0.00..240642.35 rows=7600735 width=19)
-> Hash (cost=172605.71..172605.71 rows=6716471 width=19)
-> Seq Scan on tabel_B b (cost=0.00..172605.71 rows=6716471 width=19)
-> Hash (cost=103436.52..103436.52 rows=6709052 width=19)"
-> Seq Scan on tabel D d (cost=0.00..103436.52 rows=6709052 width=19)"
-> Hash (cost=217261.00..217261.00 rows=8000000 width=39)"
-> Seq Scan on Tabel_A a (cost=0.00..217261.00 rows=8000000 width=39)"

这是解释分析输出:

Update on Tabel_A a  (cost=902288.27..2150690.80 rows=6714762 width=65) (actual time=2815452.997..2815452.997 rows=0 loops=1)
-> Hash Join (cost=902288.27..2150690.80 rows=6714762 width=65) (actual time=108861.999..214888.780 rows=5252864 loops=1)
Hash Cond: ((b.adres)::text = a.adres)
Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs) AND
(c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
(d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
Rows Removed by Join Filter: 670998
-> Hash Left Join (cost=522527.27..1318059.42 rows=6716471 width=39) (actual time=43138.635..116933.803 rows=6711432 loops=1)"
Hash Cond: ((b.adres)::text = (d.adres)::text)"
-> Hash Right Join (cost=295916.60..817658.93 rows=6716471 width=29) (actual time=34571.750..99040.256 rows=6710550 loops=1)"
Hash Cond: ((c.adres)::text = (b.adres)::text)"
-> Seq Scan on Tabel_C c (cost=0.00..240642.35 rows=7600735 width=19) (actual time=127.080..59703.935 rows=7595083 loops=1)"
-> Hash (cost=172605.71..172605.71 rows=6716471 width=19) (actual time=29925.787..29925.787 rows=6709229 loops=1)"
Buckets: 2048 Batches: 512 Memory Usage: 678kB"
-> Seq Scan on Tabel_B b (cost=0.00..172605.71 rows=6716471 width=19) (actual time=0.017..27245.069 rows=6709229 loops=1)"
-> Hash (cost=103436.52..103436.52 rows=6709052 width=19) (actual time=8566.848..8566.848 rows=6709229 loops=1)"
Buckets: 2048 Batches: 512 Memory Usage: 678kB"
-> Seq Scan on Tabel_D d (cost=0.00..103436.52 rows=6709052 width=19) (actual time=0.009..5970.010 rows=6709229 loops=1)"
-> Hash (cost=217261.00..217261.00 rows=8000000 width=39) (actual time=65721.815..65721.815 rows=8000000 loops=1)"
Buckets: 2048 Batches: 1024 Memory Usage: 612kB"
-> Seq Scan on Tabel_A a (cost=0.00..217261.00 rows=8000000 width=39) (actual time=0.056..55968.171 rows=8000000 loops=1)"
Total runtime: 2815453.549 ms"

表 A、B、C 和 D 的所有索引都在地址列上,该列用于连接。表没有其他索引。表 A 确实有一个主键 (id)。表 B、C 和 D 有 5 到 7 个额外的列未在此过程中使用。

最佳答案

此查询应产生相同的结果,但避免了 CASE

-- EXPLAIN ANALYZE
UPDATE table_a a
SET rebuilding_costs = drie.rebuilding_costs
FROM (
SELECT COALESCE(b.address, c.address, d.address) AS address
, COALESCE(b.rebuilding_costs, c.rebuilding_costs, d.rebuilding_costs)
AS rebuilding_costs
FROM table_b b
FULL OUTER JOIN table_c c ON c.address = b.address
AND c.rebuilding_costs BETWEEN 200001 AND 400000
FULL OUTER JOIN table_D d ON d.address = b.address
AND d.rebuilding_costs BETWEEN 400001 AND 600000
WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
) drie
WHERE a.address = drie.address
AND a.rebuilding_costs <> drie.rebuilding_costs -- Avoid useless updates
;

更新:基于一组链式 CTE 的类似方法:

-- --------------------------------
EXPLAIN ANALYZE
WITH cte_b AS (
SELECT b.address, b.rebuilding_costs
FROM table_b b
WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
)
, cte_c AS (
SELECT c.address , c.rebuilding_costs
FROM table_c c
WHERE c.rebuilding_costs BETWEEN 200001 AND 400000
AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = c.address)
)
, cte_d AS (
SELECT d.address , d.rebuilding_costs
FROM table_d d
WHERE d.rebuilding_costs BETWEEN 400001 AND 600000
AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = d.address)
AND NOT EXISTS (SELECT * FROM cte_c WHERE cte_c.address = d.address)
)
, cte_bcd AS (
SELECT cte_b.address, cte_b.rebuilding_costs FROM cte_b
UNION ALL SELECT cte_c.address, cte_c.rebuilding_costs FROM cte_c
UNION ALL SELECT cte_d.address, cte_d.rebuilding_costs FROM cte_d
)
UPDATE table_a a
SET rebuilding_costs = cte_bcd.rebuilding_costs
FROM cte_bcd
WHERE a.address = cte_bcd.address
-- avoid useless updates this way:
AND a.rebuilding_costs <> cte_bcd.rebuilding_costs
-- ,or this way:
-- AND cte_bcd.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs
;

更新 2:CTE 可能很慢,因为它们充当优化障碍。一种快速而肮脏的方法是将 tem 重写为(临时)VIEW,并改为引用它们。这允许优化器将查询的各个部分混入和混出子查询,甚至组合和重用它们。

CREATE TEMP VIEW cte_b AS (
SELECT b.address, b.rebuilding_costs
FROM table_b b
WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
);
CREATE TEMP VIEW cte_c AS (
SELECT c.address , c.rebuilding_costs
FROM table_c c
WHERE c.rebuilding_costs BETWEEN 200001 AND 400000
AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = c.address)
);
CREATE TEMP VIEW cte_d AS (
SELECT d.address , d.rebuilding_costs
FROM table_d d
WHERE d.rebuilding_costs BETWEEN 400001 AND 600000
AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = d.address)
AND NOT EXISTS (SELECT * FROM cte_c WHERE cte_c.address = d.address)
);
CREATE TEMP VIEW cte_bcd AS (
SELECT cte_b.address, cte_b.rebuilding_costs FROM cte_b
UNION ALL SELECT cte_c.address, cte_c.rebuilding_costs FROM cte_c
UNION ALL SELECT cte_d.address, cte_d.rebuilding_costs FROM cte_d
);
EXPLAIN -- ANALYZE
UPDATE table_a a
SET rebuilding_costs = cte_bcd.rebuilding_costs
FROM cte_bcd
WHERE a.address = cte_bcd.address
AND a.rebuilding_costs <> cte_bcd.rebuilding_costs -- avoid useless updates
-- AND a.address < 100000
;

关于postgresql - 在 PostgreSQL 中使用多个表进行大更新,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27144046/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com