gpt4 book ai didi

performance - PostgreSQL查询速度是可变的

转载 作者:行者123 更新时间:2023-11-29 11:43:34 26 4
gpt4 key购买 nike

上下文

我有一张表,用于保存网络流数据(路由器截获的所有数据包)。该表目前包含大约 590 万行。

问题

我正在尝试一个简单的查询来计算每天收到的数据包数量,这不会花很长时间。

第一次运行时,查询耗时 88 秒,第二次运行后耗时 33 秒,然后是 5 秒所有后续运行。

主要问题不是查询速度,而是同一个查询执行3次后,速度快了近20倍。
我理解查询缓存 的概念,但是原始查询运行的性能对我来说毫无意义。

测试

我用来加入 (datetime) 的列是 timestamptz 类型,并且已编入索引:

CREATE INDEX date ON netflows USING btree (datetime);

查看 EXPLAIN 语句。执行的区别在于 Nested Loop

我已经对表进行了VACUUM ANALYZE,结果完全相同。

当前环境

  • 在 VMware ESX 4.1 上运行的 Linux Ubuntu 12.04 虚拟机
  • PostgreSQL 9.1
  • VM 有 2 GB RAM,2 个内核。
  • 数据库服务器完全致力于此操作,不做任何其他事情
  • 每分钟插入表格(每分钟 100 行)
  • 非常低的磁盘、内存或 CPU 事件

查询

with date_list as (
select
series as start_date,
series + '23:59:59' as end_date
from
generate_series(
(select min(datetime) from netflows)::date,
(select max(datetime) from netflows)::date,
'1 day') as series
)
select
start_date,
end_date,
count(*)
from
netflows
inner join date_list on (datetime between start_date and end_date)
group by
start_date,
end_date;

首次运行说明(88 秒)

Sort  (cost=27007355.59..27007356.09 rows=200 width=8) (actual time=89647.054..89647.055 rows=18 loops=1) 
Sort Key: date_list.start_date
Sort Method: quicksort Memory: 25kB
CTE date_list
-> Function Scan on generate_series series (cost=0.13..12.63 rows=1000 width=8) (actual time=92.567..92.667 rows=19 loops=1)
InitPlan 2 (returns $1)
-> Result (cost=0.05..0.06 rows=1 width=0) (actual time=71.270..71.270 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.05 rows=1 width=8) (actual time=71.259..71.261 rows=1 loops=1)
-> Index Scan using date on netflows (cost=0.00..303662.15 rows=5945591 width=8) (actual time=71.252..71.252 rows=1 loops=1)
Index Cond: (datetime IS NOT NULL)
InitPlan 4 (returns $3)
-> Result (cost=0.05..0.06 rows=1 width=0) (actual time=11.786..11.787 rows=1 loops=1)
InitPlan 3 (returns $2)
-> Limit (cost=0.00..0.05 rows=1 width=8) (actual time=11.778..11.779 rows=1 loops=1)
-> Index Scan Backward using date on netflows (cost=0.00..303662.15 rows=5945591 width=8) (actual time=11.776..11.776 rows=1 loops=1)
Index Cond: (datetime IS NOT NULL)
-> HashAggregate (cost=27007333.31..27007335.31 rows=200 width=8) (actual time=89639.167..89639.179 rows=18 loops=1)
-> Nested Loop (cost=0.00..23704227.20 rows=660621222 width=8) (actual time=92.667..88059.576 rows=5945457 loops=1)
-> CTE Scan on date_list (cost=0.00..20.00 rows=1000 width=16) (actual time=92.578..92.785 rows=19 loops=1)
-> Index Scan using date on netflows (cost=0.00..13794.89 rows=660621 width=8) (actual time=2.438..4571.884 rows=312919 loops=19)
Index Cond: ((datetime >= date_list.start_date) AND (datetime <= date_list.end_date))
Total runtime: 89668.047 ms

第三次运行的解释(5 秒)

Sort  (cost=27011357.45..27011357.95 rows=200 width=8) (actual time=5645.031..5645.032 rows=18 loops=1) 
Sort Key: date_list.start_date
Sort Method: quicksort Memory: 25kB
CTE date_list
-> Function Scan on generate_series series (cost=0.13..12.63 rows=1000 width=8) (actual time=0.108..0.204 rows=19 loops=1)
InitPlan 2 (returns $1)
-> Result (cost=0.05..0.06 rows=1 width=0) (actual time=0.050..0.050 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.05 rows=1 width=8) (actual time=0.046..0.046 rows=1 loops=1)
-> Index Scan using date on netflows (cost=0.00..303705.14 rows=5946469 width=8) (actual time=0.046..0.046 rows=1 loops=1)
Index Cond: (datetime IS NOT NULL)
InitPlan 4 (returns $3)
-> Result (cost=0.05..0.06 rows=1 width=0) (actual time=0.026..0.026 rows=1 loops=1)
InitPlan 3 (returns $2)
-> Limit (cost=0.00..0.05 rows=1 width=8) (actual time=0.026..0.026 rows=1 loops=1)
-> Index Scan Backward using date on netflows (cost=0.00..303705.14 rows=5946469 width=8) (actual time=0.026..0.026 rows=1 loops=1)
Index Cond: (datetime IS NOT NULL)
-> HashAggregate (cost=27011335.17..27011337.17 rows=200 width=8) (actual time=5645.005..5645.009 rows=18 loops=1)
-> Nested Loop (cost=0.00..23707741.28 rows=660718778 width=8) (actual time=0.134..4176.406 rows=5946329 loops=1)
-> CTE Scan on date_list (cost=0.00..20.00 rows=1000 width=16) (actual time=0.110..0.343 rows=19 loops=1)
-> Index Scan using date on netflows (cost=0.00..13796.94 rows=660719 width=8) (actual time=0.026..164.117 rows=312965 loops=19)
Index Cond: ((datetime >= date_list.start_date) AND (datetime <= date_list.end_date))
Total runtime: 5645.189 ms

最佳答案

如果您正在执行 INNER JOIN,我认为您根本不需要 CTE。你可以定义

select
datetime::date,
count(*)
from netflows
group by datetime::date /* or GROUP BY 1 as Postgres extension */

我不明白为什么需要日期表,除非您希望 LEFT JOIN 在适当的地方获取零。这将意味着一次传递数据。

顺便说一句,我不鼓励您对实体和列使用诸如日期和日期时间之类的关键字;即使它是合法的,也不值得。

关于performance - PostgreSQL查询速度是可变的,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13735112/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com