gpt4 book ai didi

sql - CTE 扫描比应有的速度慢得多

转载 作者:行者123 更新时间:2023-11-29 11:35:32 25 4
gpt4 key购买 nike

以下查询正在大约 400 万行上运行。前两个 CTE 语句在大约一个小时内执行。然而,最后一个有望持续超过 15 年。

WITH parsed AS (
SELECT name, array(...) description FROM import
), counts AS (
SELECT unnest(description) token, count(*) FROM parsed GROUP BY 1
)
INSERT INTO table (name, description)
SELECT name, ARRAY(
SELECT ROW(token, count)::a
FROM (
SELECT token, (
SELECT count
FROM counts
WHERE a.token=counts.token
)
FROM UNNEST(description) a(token)
) _
)::a[] description
FROM parsed;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Insert on table (cost=55100824.40..162597717038.41 rows=3611956 width=96)
CTE parsed
-> Seq Scan on import (cost=0.00..51425557.67 rows=3611956 width=787)
Filter: ((name IS NOT NULL) AND (description IS NOT NULL))
SubPlan 1
-> HashAggregate (cost=11.59..12.60 rows=101 width=55)
-> Append (cost=0.00..11.34 rows=101 width=55)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Index Scan using import_aliases_mid_idx on import_aliases (cost=0.00..10.32 rows=100 width=56)
Index Cond: (mid = "substring"(import.mid, 5))
SubPlan 2
-> HashAggregate (cost=0.78..1.30 rows=100 width=0)
-> Result (cost=0.00..0.53 rows=100 width=0)
CTE counts
-> HashAggregate (cost=3675165.23..3675266.73 rows=20000 width=32)
-> CTE Scan on parsed (cost=0.00..1869187.23 rows=361195600 width=32)
-> CTE Scan on parsed (cost=0.00..162542616214.01 rows=3611956 width=96)
SubPlan 6
-> Function Scan on unnest a (cost=0.00..45001.25 rows=100 width=32)
SubPlan 5
-> CTE Scan on counts (cost=0.00..450.00 rows=100 width=8)
Filter: (a.token = token)

parsedcounts 中大约有 400 万行。查询当前正在运行,最终语句大约每 2 分钟插入一行。它几乎不接触磁盘,但疯狂地吃 CPU,我很困惑。

查询有什么问题?

最后的语句应该在 counts 中查找 description 的每个元素,将像这样的 [a,b,c] 转换为像这样的 [(a,9),(b,4),(c,0)] 并插入它。


编辑

将 parsed 和 counts 作为表,并将 token in counts 索引,这是计划:

explain INSERT INTO table (name, mid, description) SELECT name, mid, ARRAY(SELECT ROW(token, count)::a FROM (SELECT token, (SELECT count FROM counts WHERE a.token=counts.token) FROM UNNEST(description) a(token)) _)::a[] description FROM parsed;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Insert on table (cost=0.00..5761751808.75 rows=4002061 width=721)
-> Seq Scan on parsed (cost=0.00..5761751808.75 rows=4002061 width=721)
SubPlan 2
-> Function Scan on unnest a (cost=0.00..1439.59 rows=100 width=32)
SubPlan 1
-> Index Scan using counts_token_idx on counts (cost=0.00..14.39 rows=1 width=4)
Index Cond: (a.token = token)

哪个更合理。这些数组平均有 57 个元素,所以我猜这只是对可能相当低效的 CTE 表的大量查找导致性能下降。它现在以每秒 300 行的速度运行,对此我很满意。

最佳答案

正如我在对问题的编辑中所述,将解析和计数作为表,并将计数中的 token 编入索引,速度要快得多。我假设 CTE 连接比它们更聪明。

关于sql - CTE 扫描比应有的速度慢得多,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12848711/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com