gpt4 book ai didi

sql - 自连接、交叉连接和分组

转载 作者:行者123 更新时间:2023-11-29 15:07:23 26 4
gpt4 key购买 nike

我从多个来源获得了一段时间内的温度样本表,我想找到所有来源在设定时间间隔内的最低、最高和平均温度。乍一看,这很容易完成,如下所示:

SELECT MIN(temp), MAX(temp), AVG(temp) FROM samples GROUP BY time;

但是,如果源进出,事情会变得更加复杂(到了让我感到困惑的地步!),而不是在有问题的时间间隔内忽略丢失的源,我想使用源的最后知道的温度对于缺失的样本。使用日期时间并在随时间分布不均匀的样本中构建间隔(例如每分钟)会使事情变得更加复杂。

我认为应该可以通过在样本表上进行自连接来创建我想要的结果,其中第一个表的时间大于或等于第二个表的时间,然后计算聚合值按源分组的行。然而,我对如何实际做到这一点感到困惑。

这是我的测试表:

+------+------+------+
| time | source | temp |
+------+------+------+
| 1 | a | 20 |
| 1 | b | 18 |
| 1 | c | 23 |
| 2 | b | 21 |
| 2 | c | 20 |
| 2 | a | 18 |
| 3 | a | 16 |
| 3 | c | 13 |
| 4 | c | 15 |
| 4 | a | 4 |
| 4 | b | 31 |
| 5 | b | 10 |
| 5 | c | 16 |
| 5 | a | 22 |
| 6 | a | 18 |
| 6 | b | 17 |
| 7 | a | 20 |
| 7 | b | 19 |
+------+------+------+
INSERT INTO samples (time, source, temp) VALUES (1, 'a', 20), (1, 'b', 18), (1, 'c', 23), (2, 'b', 21), (2, 'c', 20), (2, 'a', 18), (3, 'a', 16), (3, 'c', 13), (4, 'c', 15), (4, 'a', 4), (4, 'b', 31), (5, 'b', 10), (5, 'c', 16), (5, 'a', 22), (6, 'a', 18), (6, 'b', 17), (7, 'a', 20), (7, 'b', 19);

为了进行最小值、最大值和平均值计算,我需要一个如下所示的中间表:

+------+------+------+
| time | source | temp |
+------+------+------+
| 1 | a | 20 |
| 1 | b | 18 |
| 1 | c | 23 |
| 2 | b | 21 |
| 2 | c | 20 |
| 2 | a | 18 |
| 3 | a | 16 |
| 3 | b | 21 |
| 3 | c | 13 |
| 4 | c | 15 |
| 4 | a | 4 |
| 4 | b | 31 |
| 5 | b | 10 |
| 5 | c | 16 |
| 5 | a | 22 |
| 6 | a | 18 |
| 6 | b | 17 |
| 6 | c | 16 |
| 7 | a | 20 |
| 7 | b | 19 |
| 7 | c | 16 |
+------+------+------+

以下查询使我接近我想要的结果,但它采用源第一个结果的温度值,而不是给定时间间隔的最新结果:

SELECT s.dt as sdt, s.mac, ss.temp, MAX(ss.dt) as maxdt FROM (SELECT DISTINCT dt FROM samples) AS s CROSS JOIN samples AS ss WHERE s.dt >= ss.dt GROUP BY sdt, mac HAVING maxdt <= s.dt ORDER BY sdt ASC, maxdt ASC;

+------+------+------+-------+
| sdt | mac | temp | maxdt |
+------+------+------+-------+
| 1 | a | 20 | 1 |
| 1 | c | 23 | 1 |
| 1 | b | 18 | 1 |
| 2 | a | 20 | 2 |
| 2 | c | 23 | 2 |
| 2 | b | 18 | 2 |
| 3 | b | 18 | 2 |
| 3 | a | 20 | 3 |
| 3 | c | 23 | 3 |
| 4 | a | 20 | 4 |
| 4 | c | 23 | 4 |
| 4 | b | 18 | 4 |
| 5 | a | 20 | 5 |
| 5 | c | 23 | 5 |
| 5 | b | 18 | 5 |
| 6 | c | 23 | 5 |
| 6 | a | 20 | 6 |
| 6 | b | 18 | 6 |
| 7 | c | 23 | 5 |
| 7 | b | 18 | 7 |
| 7 | a | 20 | 7 |
+------+------+------+-------+

更新: chadhoc(顺便说一句,很棒的名字!)提供了一个很好的解决方案,不幸的是它在 MySQL 中不起作用,因为它不支持 FULL JOIN他用。幸运的是,我相信一个简单的UNION是一个有效的替代品:

-- Unify the original samples with the missing values that we've calculated
(
SELECT time, source, temp
FROM samples
)
UNION
( -- Pull all the time/source combinations that we are missing from the sample set, along with the temp
-- from the last sampled interval for the same time/source combination if we do not have one
SELECT a.time, a.source, (SELECT t2.temp FROM samples AS t2 WHERE t2.time < a.time AND t2.source = a.source ORDER BY t2.time DESC LIMIT 1) AS temp
FROM
( -- All values we want to get should be a cross of time/temp
SELECT t1.time, s1.source
FROM
(SELECT DISTINCT time FROM samples) AS t1
CROSS JOIN
(SELECT DISTINCT source FROM samples) AS s1
) AS a
LEFT JOIN samples s
ON a.time = s.time
AND a.source = s.source
WHERE s.source IS NULL
)
ORDER BY time, source;

更新 2: MySQL 给出以下 EXPLAIN chadhoc 代码的输出:

+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------------------+
| 1 | PRIMARY | temp | ALL | NULL | NULL | NULL | NULL | 18 | |
| 2 | UNION | <derived4> | ALL | NULL | NULL | NULL | NULL | 21 | |
| 2 | UNION | s | ALL | NULL | NULL | NULL | NULL | 18 | Using where |
| 4 | DERIVED | <derived6> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 4 | DERIVED | <derived5> | ALL | NULL | NULL | NULL | NULL | 7 | |
| 6 | DERIVED | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using temporary |
| 5 | DERIVED | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using temporary |
| 3 | DEPENDENT SUBQUERY | t2 | ALL | NULL | NULL | NULL | NULL | 18 | Using where; Using filesort |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | Using filesort |
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------------------+

我能够让 Charles 的代码像这样工作:

SELECT T.time, S.source,
COALESCE(
D.temp,
(
SELECT temp FROM samples
WHERE source = S.source AND time = (
SELECT MAX(time)
FROM samples
WHERE
source = S.source
AND time < T.time
)
)
) AS temp
FROM (SELECT DISTINCT time FROM samples) AS T
CROSS JOIN (SELECT DISTINCT source FROM samples) AS S
LEFT JOIN samples AS D
ON D.source = S.source AND D.time = T.time

其解释是:

+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------+
| 1 | PRIMARY | <derived5> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 7 | |
| 1 | PRIMARY | D | ALL | NULL | NULL | NULL | NULL | 18 | |
| 5 | DERIVED | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using temporary |
| 4 | DERIVED | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using temporary |
| 2 | DEPENDENT SUBQUERY | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using where |
| 3 | DEPENDENT SUBQUERY | temp | ALL | NULL | NULL | NULL | NULL | 18 | Using where |
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------+

最佳答案

我认为使用 mySql 中的排名/窗口函数会获得更好的性能,但不幸的是我不知道这些以及 TSQL 实现。这是一个符合 ANSI 的解决方案,可以正常工作:

-- Full join across the sample set and anything missing from the sample set, pulling the missing temp first if we do not have one
select coalesce(c1.[time], c2.[time]) as dt, coalesce(c1.source, c2.source) as source, coalesce(c2.temp, c1.temp) as temp
from samples c1
full join ( -- Pull all the time/source combinations that we are missing from the sample set, along with the temp
-- from the last sampled interval for the same time/source combination if we do not have one
select a.time, a.source,
(select top 1 t2.temp from samples t2 where t2.time < a.time and t2.source = a.source order by t2.time desc) as temp
from
( -- All values we want to get should be a cross of time/samples
select t1.[time], s1.source
from
(select distinct [time] from samples) as t1
cross join
(select distinct source from samples) as s1
) a
left join samples s
on a.[time] = s.time
and a.source = s.source
where s.source is null
) c2
on c1.time = c2.time
and c1.source = c2.source
order by dt, source

关于sql - 自连接、交叉连接和分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1717766/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com