gpt4 book ai didi

postgresql - 生成按列分组的值的直方图

转载 作者:行者123 更新时间:2023-11-29 11:29:12 28 4
gpt4 key购买 nike

我在 reviews 表中有以下数据用于某些项目,使用范围从 0 到 100 的评分系统

+-----------+---------+-------+
| review_id | item_id | score |
+-----------+---------+-------+
| 1 | 1 | 90 |
+-----------+---------+-------+
| 2 | 1 | 40 |
+-----------+---------+-------+
| 3 | 1 | 10 |
+-----------+---------+-------+
| 4 | 2 | 90 |
+-----------+---------+-------+
| 5 | 2 | 90 |
+-----------+---------+-------+
| 6 | 2 | 70 |
+-----------+---------+-------+
| 7 | 3 | 80 |
+-----------+---------+-------+
| 8 | 3 | 80 |
+-----------+---------+-------+
| 9 | 3 | 80 |
+-----------+---------+-------+
| 10 | 3 | 80 |
+-----------+---------+-------+
| 11 | 4 | 10 |
+-----------+---------+-------+
| 12 | 4 | 30 |
+-----------+---------+-------+
| 13 | 4 | 50 |
+-----------+---------+-------+
| 14 | 4 | 80 |
+-----------+---------+-------+

我正在尝试创建一个 bin 大小为 5 的分值直方图。我的目标是为每个项目生成一个直方图。为了创建整个表的直方图,可以使用 width_bucket .这也可以调整为在每个项目的基础上运行:

SELECT item_id, g.n as bucket, COUNT(m.score) as count 
FROM generate_series(1, 5) g(n) LEFT JOIN
review as m
ON width_bucket(score, 0, 100, 4) = g.n
GROUP BY item_id, g.n
ORDER BY item_id, g.n;

但是,结果是这样的:

+---------+--------+-------+
| item_id | bucket | count |
+---------+--------+-------+
| 1 | 5 | 1 |
+---------+--------+-------+
| 1 | 3 | 1 |
+---------+--------+-------+
| 1 | 1 | 1 |
+---------+--------+-------+
| 2 | 5 | 2 |
+---------+--------+-------+
| 2 | 4 | 2 |
+---------+--------+-------+
| 3 | 4 | 4 |
+---------+--------+-------+
| 4 | 1 | 1 |
+---------+--------+-------+
| 4 | 2 | 1 |
+---------+--------+-------+
| 4 | 3 | 1 |
+---------+--------+-------+
| 4 | 4 | 1 |
+---------+--------+-------+

也就是说,不包括没有条目的箱子。虽然我发现这不是一个糟糕的解决方案,但我宁愿拥有所有存储桶,而没有条目的存储桶为 0。更好的是,使用这种结构:

+---------+----------+----------+----------+----------+----------+
| item_id | bucket_1 | bucket_2 | bucket_3 | bucket_4 | bucket_5 |
+---------+----------+----------+----------+----------+----------+
| 1 | 1 | 0 | 1 | 0 | 1 |
+---------+----------+----------+----------+----------+----------+
| 2 | 0 | 0 | 0 | 2 | 2 |
+---------+----------+----------+----------+----------+----------+
| 3 | 0 | 0 | 0 | 4 | 0 |
+---------+----------+----------+----------+----------+----------+
| 4 | 1 | 1 | 1 | 1 | 0 |
+---------+----------+----------+----------+----------+----------+

我更喜欢这种解决方案,因为它对每个项目使用一行(而不是 5n),这样查询起来更简单,并且可以最大限度地减少内存消耗和数据传输成本。我目前的做法如下:

select item_id, 
(sum(case when score >= 0 and score <= 19 then 1 else 0 end)) as bucket_1,
(sum(case when score >= 20 and score <= 39 then 1 else 0 end)) as bucket_2,
(sum(case when score >= 40 and score <= 59 then 1 else 0 end)) as bucket_3,
(sum(case when score >= 60 and score <= 79 then 1 else 0 end)) as bucket_4,
(sum(case when score >= 80 and score <= 100 then 1 else 0 end)) as bucket_5
from review;

虽然这个查询满足了我的要求,但我很好奇是否有更优雅的方法。如此多的 case 语句不容易阅读,并且 bin 标准的更改可能需要更新每个总和。此外,我很好奇此查询可能存在的潜在性能问题。

最佳答案

第二个查询可以重写为使用 ranges使编辑和编写查询更容易一些:

with buckets (b1, b2, b3, b4, b5) as (
values (
int4range(0, 20), int4range(20, 40), int4range(40, 60), int4range(60, 80), int4range(80, 100)
)
)
select item_id,
count(*) filter (where b1 @> score) as bucket_1,
count(*) filter (where b2 @> score) as bucket_2,
count(*) filter (where b3 @> score) as bucket_3,
count(*) filter (where b4 @> score) as bucket_4,
count(*) filter (where b5 @> score) as bucket_5
from review
cross join buckets
group by item_id
order by item_id;

int4range(0,20) 构造的范围包括下限并排除上限。

CTE named buckets 仅创建一行,因此交叉连接不会更改 review 表中的行数。

关于postgresql - 生成按列分组的值的直方图,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51384531/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com