gpt4 book ai didi

sql - Postgresql 中具有 Hash Join 的加速查询

转载 作者:行者123 更新时间:2023-11-29 12:41:28 25 4
gpt4 key购买 nike

我有以下查询:

SELECT
Sum(fact_individual_re.quality_hours) AS C0,
dim_gender.name AS C1,
dim_date.year AS C2
FROM
fact_individual_re
INNER JOIN dim_date ON fact_individual_re.dim_date_id = dim_date.id
INNER JOIN dim_gender ON fact_individual_re.dim_gender_id = dim_gender.id
GROUP BY dim_date.year,dim_gender.name
ORDER BY dim_date.year ASC,dim_gender.name ASC,Sum(fact_individual_re.quality_hours) ASC

在解释它的计划时,HASH JOIN 花费了最多的时间。有什么办法可以最大限度地减少 HASH JOIN 的时间:

Sort  (cost=190370.50..190370.55 rows=20 width=18) (actual time=4005.152..4005.154 rows=20 loops=1)
Sort Key: dim_date.year, dim_gender.name, (sum(fact_individual_re.quality_hours))
Sort Method: quicksort Memory: 26kB
-> Finalize GroupAggregate (cost=190369.07..190370.07 rows=20 width=18) (actual time=4005.106..4005.135 rows=20 loops=1)
Group Key: dim_date.year, dim_gender.name
-> Sort (cost=190369.07..190369.27 rows=80 width=18) (actual time=4005.100..4005.103 rows=100 loops=1)
Sort Key: dim_date.year, dim_gender.name
Sort Method: quicksort Memory: 32kB
-> Gather (cost=190358.34..190366.54 rows=80 width=18) (actual time=4004.966..4005.020 rows=100 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Partial HashAggregate (cost=189358.34..189358.54 rows=20 width=18) (actual time=3885.254..3885.259 rows=20 loops=5)
Group Key: dim_date.year, dim_gender.name
-> Hash Join (cost=125.17..170608.34 rows=2500000 width=14) (actual time=2.279..2865.808 rows=2000000 loops=5)
Hash Cond: (fact_individual_re.dim_gender_id = dim_gender.id)
-> Hash Join (cost=124.13..150138.54 rows=2500000 width=12) (actual time=2.060..2115.234 rows=2000000 loops=5)
Hash Cond: (fact_individual_re.dim_date_id = dim_date.id)
-> Parallel Seq Scan on fact_individual_re (cost=0.00..118458.00 rows=2500000 width=12) (actual time=0.204..982.810 rows=2000000 loops=5)
-> Hash (cost=78.50..78.50 rows=3650 width=8) (actual time=1.824..1.824 rows=3650 loops=5)
Buckets: 4096 Batches: 1 Memory Usage: 175kB
-> Seq Scan on dim_date (cost=0.00..78.50 rows=3650 width=8) (actual time=0.143..1.030 rows=3650 loops=5)
-> Hash (cost=1.02..1.02 rows=2 width=10) (actual time=0.193..0.193 rows=2 loops=5)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on dim_gender (cost=0.00..1.02 rows=2 width=10) (actual time=0.181..0.182 rows=2 loops=5)
Planning time: 0.609 ms
Execution time: 4020.423 ms
(26 rows)

我正在使用 postgresql v10。

最佳答案

我建议在连接之前对行进行部分分组:

select
sum(quality_hours_sum) AS C0,
dim_gender.name AS C1,
dim_date.year AS C2
from
(
select
sum(quality_hours) as quality_hours_sum,
dim_date_id,
dim_gender_id
from fact_individual_re
group by dim_date_id, dim_gender_id
) as fact_individual_re_sum
join dim_date on dim_date_id = dim_date.id
join dim_gender on dim_gender_id = dim_gender.id
group by dim_date.year, dim_gender.name
order by dim_date.year, dim_gender.name, 0;

这样,您将仅加入 1460 行 (count(distinct dim_date_id)*count(distint dim_gender_id)),而不是全部 2M 行。尽管它仍然需要读取和分组所有 2M 行 - 为了避免这种情况,您需要使用触发器维护汇总表之类的东西。

关于sql - Postgresql 中具有 Hash Join 的加速查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48385883/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com