gpt4 book ai didi

sql - 我如何优化这个组明智的最大 sql 查询?

转载 作者:行者123 更新时间:2023-11-29 13:28:11 24 4
gpt4 key购买 nike

这就是我想要的。它基本上从 daily_statistics 表中获取所有记录并按 user_id 分组。同时,它还做了以下事情:

  1. 用户的值按最近分组
  2. attachment_ids 表示为一个数组,因此我可以确定用户有多少个附件

结果:

 user_id | country_id |       time_at       | assumed_gender |    attachment_ids
---------+------------+---------------------+----------------+----------------------
21581 | 172 | 2015-04-18 17:55:00 | | [5942]
21610 | 140 | 2015-04-18 19:55:00 | male | [5940]
22044 | 174 | 2015-04-18 21:55:00 | female | [12312313, 12312313]

21353 | 174 | 2015-04-18 20:59:00 | male | [5938]
21573 | 246 | 2015-04-18 21:57:00 | male | [5936]
(5 rows)

以下查询执行缓慢。大约 17 秒。

  SELECT
ds.user_id,
max(case when id=maxid then country_id end) AS country_id,
max(case when id=maxid then time_at end) AS time_at,
max(case when id=maxid then properties->'assumed_gender' end) AS assumed_gender,
json_agg(to_json(attachment_id)) AS attachment_ids
FROM daily_statistics ds JOIN (
SELECT u.id as user_id, (
SELECT ds2.id FROM daily_statistics ds2 WHERE ds2.user_id=u.id AND ds2.metric = 'participation' AND ds2.status = 'active' AND ds2.campaign_id = 39
ORDER BY ds2.id DESC LIMIT 1
) AS maxid FROM users u
WHERE u.properties -> 'provider' IN ('twitter')
) mu ON (ds.user_id=mu.user_id)
WHERE ds.campaign_id = 39 AND ds.metric = 'participation' AND ds.status = 'active'
GROUP BY ds.user_id;

问题在于 group wise max 语句。有没有办法优化这个查询并获得相同的输出?我在考虑使用某种横向连接?但是这样我就无法获得每个用户的 attachment_id 的数量。

编辑:对于 9k++ 的记录,表大小为 2m 行:执行此查询大约需要 25 秒。

foobar_production=> EXPLAIN ANALYZE SELECT
foobar_production-> ds.user_id,
foobar_production-> max(case when id=maxid then country_id end) AS country_id,
foobar_production-> max(case when id=maxid then time_at end) AS time_at,
foobar_production-> max(case when id=maxid then properties->'assumed_gender' end) AS assumed_gender,
foobar_production-> json_agg(to_json(attachment_id)) AS attachment_ids
foobar_production-> FROM daily_statistics ds JOIN (
foobar_production(> SELECT u.id as user_id, (
foobar_production(> SELECT ds2.id FROM daily_statistics ds2 WHERE ds2.user_id=u.id AND ds2.metric = 'participation' AND ds2.status = 'active' AND ds2.campaign_id = 4742
foobar_production(> ORDER BY ds2.id DESC LIMIT 1
foobar_production(> ) AS maxid FROM users u
foobar_production(> WHERE u.properties -> 'provider' IN ('twitter')
foobar_production(> ) mu ON (ds.user_id=mu.user_id)
foobar_production-> WHERE ds.campaign_id = 4742 AND ds.metric = 'participation' AND ds.status = 'active'
foobar_production-> GROUP BY ds.user_id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2063.07..2063.08 rows=1 width=103) (actual time=25155.963..25156.859 rows=775 loops=1)
-> Nested Loop (cost=0.98..1883.99 rows=2 width=103) (actual time=0.744..382.699 rows=2787 loops=1)
-> Index Scan using index_daily_statistics_on_campaign_id_and_type on daily_statistics ds (cost=0.56..1621.73 rows=31 width=99) (actual time=0.107..33.513 rows=9751 loops=1)
Index Cond: (campaign_id = 4742)
Filter: (((metric)::text = 'participation'::text) AND ((status)::text = 'active'::text))
-> Index Scan using index_users_on_id_and_type on users u (cost=0.42..8.45 rows=1 width=4) (actual time=0.024..0.024 rows=0 loops=9751)
Index Cond: (id = ds.user_id)
Filter: ((properties -> 'provider'::text) = 'twitter'::text)
Rows Removed by Filter: 1
SubPlan 1
-> Limit (cost=29.83..29.84 rows=1 width=4) (actual time=2.953..2.954 rows=1 loops=2787)
-> Sort (cost=29.83..29.84 rows=1 width=4) (actual time=2.951..2.951 rows=1 loops=2787)
Sort Key: ds2.id
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on daily_statistics ds2 (cost=25.80..29.82 rows=1 width=4) (actual time=2.381..2.702 rows=105 loops=2787)
Recheck Cond: ((user_id = u.id) AND (campaign_id = 4742))
Filter: (((metric)::text = 'participation'::text) AND ((status)::text = 'active'::text))
-> BitmapAnd (cost=25.80..25.80 rows=1 width=0) (actual time=2.365..2.365 rows=0 loops=2787)
-> Bitmap Index Scan on index_daily_statistics_on_user_id (cost=0.00..5.60 rows=156 width=0) (actual time=0.072..0.072 rows=292 loops=2787)
Index Cond: (user_id = u.id)
-> Bitmap Index Scan on index_daily_statistics_on_campaign_id_and_type (cost=0.00..19.95 rows=453 width=0) (actual time=2.241..2.241 rows=9751 loops=2787)
Index Cond: (campaign_id = 4742)
SubPlan 2
-> Limit (cost=29.83..29.84 rows=1 width=4) (actual time=2.879..2.880 rows=1 loops=2787)
-> Sort (cost=29.83..29.84 rows=1 width=4) (actual time=2.876..2.876 rows=1 loops=2787)
Sort Key: ds2_1.id
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on daily_statistics ds2_1 (cost=25.80..29.82 rows=1 width=4) (actual time=2.241..2.585 rows=105 loops=2787)
Recheck Cond: ((user_id = u.id) AND (campaign_id = 4742))
Filter: (((metric)::text = 'participation'::text) AND ((status)::text = 'active'::text))
-> BitmapAnd (cost=25.80..25.80 rows=1 width=0) (actual time=2.222..2.222 rows=0 loops=2787)
-> Bitmap Index Scan on index_daily_statistics_on_user_id (cost=0.00..5.60 rows=156 width=0) (actual time=0.062..0.062 rows=292 loops=2787)
Index Cond: (user_id = u.id)
-> Bitmap Index Scan on index_daily_statistics_on_campaign_id_and_type (cost=0.00..19.95 rows=453 width=0) (actual time=2.124..2.124 rows=9751 loops=2787)
Index Cond: (campaign_id = 4742)
SubPlan 3
-> Limit (cost=29.83..29.84 rows=1 width=4) (actual time=3.030..3.030 rows=1 loops=2787)
-> Sort (cost=29.83..29.84 rows=1 width=4) (actual time=3.018..3.018 rows=1 loops=2787)
Sort Key: ds2_2.id
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on daily_statistics ds2_2 (cost=25.80..29.82 rows=1 width=4) (actual time=2.407..2.755 rows=105 loops=2787)
Recheck Cond: ((user_id = u.id) AND (campaign_id = 4742))
Filter: (((metric)::text = 'participation'::text) AND ((status)::text = 'active'::text))
-> BitmapAnd (cost=25.80..25.80 rows=1 width=0) (actual time=2.390..2.390 rows=0 loops=2787)
-> Bitmap Index Scan on index_daily_statistics_on_user_id (cost=0.00..5.60 rows=156 width=0) (actual time=0.121..0.121 rows=292 loops=2787)
Index Cond: (user_id = u.id)
-> Bitmap Index Scan on index_daily_statistics_on_campaign_id_and_type (cost=0.00..19.95 rows=453 width=0) (actual time=2.233..2.233 rows=9751 loops=2787)
Index Cond: (campaign_id = 4742)
Total runtime: 25158.063 ms
(49 rows)


foobar_production=> \d daily_statistics;
Table "public.daily_statistics"
Column | Type | Modifiers
---------------+-----------------------------+---------------------------------------------------------------
id | integer | not null default nextval('daily_statistics_id_seq'::regclass)
type | character varying(255) |
metric | character varying(255) |
campaign_id | integer |
user_id | integer |
country_id | integer |
attachment_id | integer |
time_at | timestamp without time zone |
properties | hstore |
status | character varying(255) | default 'active'::character varying
Indexes:
"daily_statistics_pkey" PRIMARY KEY, btree (id)
"index_daily_statistics_on_attachment_id" btree (attachment_id)
"index_daily_statistics_on_campaign_id_and_type" btree (campaign_id, type)
"index_daily_statistics_on_country_id" btree (country_id)
"index_daily_statistics_on_id" btree (id)
"index_daily_statistics_on_metric" btree (metric)
"index_daily_statistics_on_properties" gin (properties)
"index_daily_statistics_on_status" btree (status)
"index_daily_statistics_on_time_at" btree (time_at)
"index_daily_statistics_on_user_id" btree (user_id)

想法将不胜感激。

最佳答案

这里似乎有 2 个部分:

  1. 第一个是获取用户的最新统计数据
  2. 另一个正在为用户累积所有 attachment_id

两者都是针对特定类型的统计。由于您对 users 感兴趣,所以我会首先从他们开始。

使用此查询搜索最新条目:

SELECT u.id,
ds.country_id,
ds.time_at,
ds.properties->'assumed_gender' AS assumed_gender
FROM users u
JOIN LATERAL (
SELECT * FROM daily_statistics
WHERE user_id=u.id
AND campaign_id = 39
AND metric = 'participation'
AND status = 'active'
ORDER BY id DESC LIMIT 1
) ds ON true
WHERE u.properties -> 'provider' IN ('twitter');

我使用 LATERAL syntax在这里,这非常适合此类查询。

聚合不会从中受益,因此需要另一个子查询。

我得到了以下最终查询:

SELECT u.id,
ds.country_id,
ds.time_at,
ds.properties->'assumed_gender' AS assumed_gender,
g.attachment_ids
FROM users u
JOIN LATERAL (
SELECT * FROM daily_statistics
WHERE user_id=u.id
AND campaign_id = 39
AND metric = 'participation'
AND status = 'active'
ORDER BY id DESC LIMIT 1
) ds ON true
JOIN (
SELECT user_id, json_agg(to_json(attachment_id)) AS attachment_ids
FROM daily_statistics
WHERE campaign_id = 39
AND metric = 'participation'
AND status = 'active'
GROUP BY user_id
) g ON g.user_id=u.id
WHERE u.properties -> 'provider' IN ('twitter');

我假设,那个指数:

CREATE INDEX i_ds_campaign4status
ON daily_statistics(campaign_id, user_id, id)
WHERE status='active';

会有所帮助。不过,这取决于您的数据,如果您的所有统计数据都active,则删除WHERE 子句。


编辑:根据提供的计划,第二个查询受益于与聚合的连接,因为它减少了 LATERAL 部分的迭代次数。我会坚持这种方法。

关于sql - 我如何优化这个组明智的最大 sql 查询?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29748534/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com