gpt4 book ai didi

SQL 连接一系列值(整数范围、日期范围等)

转载 作者:行者123 更新时间:2023-12-03 16:10:36 25 4
gpt4 key购买 nike

我有两个表,第一个是一个大表(数百万行),最有趣的列是一个整数,我将称之为“键”。我相信这个解决方案对于日期或日期时间范围也是相同的。

第二个表要小得多(数千行),有一堆我感兴趣的属性,这些属性是在一系列键上定义的。它具有以下结构:

key_lower_bound : 整数
key_upper_bound : 整数
Interest_value1 : 浮点数
有趣的值2:整数
有趣的值3:varchar(50)
...

我想查找第一个表中的所有值,并根据第一个表中的键是否落在区间 [key_lower_bound, key_upper_bound) 内将它们与第二个表“连接”。

这在数学上有点像稀疏内积或稀疏点积,但有点奇怪,因为第二个表中涉及这些范围。尽管如此,如果我要在代码中编写它,它将是一个 O(|first table| + |second table|) 算法。我会保留一个指向两个(已排序)列表的指针并遍历它们,以确定第一个表中的每个键是否属于第二个表的范围。诀窍是每次检查第一个表中的键时我不会遍历第二个列表,因为两个列表都已排序。

当我构建最明显的 SQL 查询时(包括检查键是 > key_lower_bound 和 < key_upper_bound),它花费的时间太长了。

这个幼稚的查询会发生某种二次行为,因为我认为查询引擎正在对第二个表中的每一行进行每次比较,而实际上,如果第二个表按 key_lower_bounds 排序,这应该是不必要的。所以我得到了 O(|first table| x |second table|) 类型的行为,而不是所需的 O(|first table| + |second table|) 行为。

是否可以通过线性 SQL 查询来执行此操作?

最佳答案

好吧,我已经解决了这个问题,并有一些建议。
但首先让我们填充助 watch

CREATE TABLE dbo.Numbers(n INT NOT NULL PRIMARY KEY)
GO
DECLARE @i INT;
SET @i = 1;
INSERT INTO dbo.Numbers(n) SELECT 1;
WHILE @i<1024000 BEGIN
INSERT INTO dbo.Numbers(n)
SELECT n + @i FROM dbo.Numbers;
SET @i = @i * 2;
END;
GO

和测试数据,一年每分钟一分钟广告,同年每分钟一个客户电话:
CREATE TABLE dbo.Commercials(
StartedAt DATETIME NOT NULL
CONSTRAINT PK_Commercials PRIMARY KEY,
EndedAt DATETIME NOT NULL,
CommercialName VARCHAR(30) NOT NULL);
GO
INSERT INTO dbo.Commercials(StartedAt, EndedAt, CommercialName)
SELECT DATEADD(minute, n - 1, '20080101')
,DATEADD(minute, n, '20080101')
,'Show #'+CAST(n AS VARCHAR(6))
FROM dbo.Numbers
WHERE n<=24*365*60;
GO
CREATE TABLE dbo.Calls(CallID INT
CONSTRAINT PK_Calls NOT NULL PRIMARY KEY,
AirTime DATETIME NOT NULL,
SomeInfo CHAR(300));
GO
INSERT INTO dbo.Calls(CallID,
AirTime,
SomeInfo)
SELECT n
,DATEADD(minute, n - 1, '20080101')
,'Call during Commercial #'+CAST(n AS VARCHAR(6))
FROM dbo.Numbers
WHERE n<=24*365*60;
GO
CREATE UNIQUE INDEX Calls_AirTime
ON dbo.Calls(AirTime) INCLUDE(SomeInfo);
GO

最初尝试选择年中三个小时的广告期间的所有电话是非常缓慢的:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
GO

SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c
ON c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time:
CPU time = 15 ms, elapsed time = 30 ms.

(1 row(s) affected)
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 3338264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 2, logical reads 7166, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 71704 ms, elapsed time = 36316 ms.

原因很简单:我们知道广告是不重叠的,所以一个电话
最多适合一个商业广告,但优化器不知道。
我们知道广告很短,但优化器也不知道。
这两个假设都可以作为约束强制执行,但优化器不会仍然如此。

假设广告不超过 15 分钟,我们可以看出
优化器,并且查询非常快:
SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c
ON c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
AND s.StartedAt BETWEEN '20080630 23:45' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time:
CPU time = 15 ms, elapsed time = 15 ms.

(1 row(s) affected)
Table 'Worktable'. Scan count 1, logical reads 753, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 31 ms, elapsed time = 24 ms.

假设广告不重叠所以一个电话
最多适合一个广告,我们可以看出
到优化器,查询再次非常快:
SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Calls c CROSS APPLY(
SELECT TOP 1 s.StartedAt, s.EndedAt FROM dbo.Commercials s
WHERE c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
ORDER BY s.StartedAt DESC) AS s
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 7 ms.

(1 row(s) affected)
Table 'Commercials'. Scan count 181, logical reads 1327, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 31 ms, elapsed time = 31 ms.

关于SQL 连接一系列值(整数范围、日期范围等),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1060524/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com