gpt4 book ai didi

sql - 每组最大 n 个 SQL 查询的高性能方法

转载 作者:行者123 更新时间:2023-12-02 01:45:28 24 4
gpt4 key购买 nike

我正在尝试构建一个基础设施,以便根据需要快速运行回归,从包含我们网络服务器上所有历史事件的数据库中提取 apache 请求。为了通过确保我们仍然回归较小客户的请求来提高覆盖范围,我想通过为每个客户检索最多 n 个(为了这个问题,假设 10 个)请求来确保请求的分布。

我发现这里回答了许多类似的问题,其中最接近的似乎是 SQL query to return top N rows per ID across a range of IDs ,但答案大部分是与性能无关的解决方案我已经尝试过了。例如,row_number() 分析函数为我们提供了我们正在寻找的数据:

SELECT
*
FROM
(
SELECT
dailylogdata.*,
row_number() over (partition by dailylogdata.contextid order by occurrencedate) rn
FROM
dailylogdata
WHERE
shorturl in (?)
)
WHERE
rn <= 10;

但是,鉴于该表在给定日期包含数百万个条目,并且这种方法需要从索引中读取与我们的选择标准匹配的所有行才能应用 row_number 分析函数,因此性能非常糟糕。我们最终选择了近一百万行,却因为 row_number 超过 10 而丢弃了其中的绝大多数。执行上述查询的统计信息:

|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp||
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 12222 |00:09:08.94 | 895K| 584K| 301 | | | | ||
||* 1 | VIEW | | 1 | 4427K| 12222 |00:09:08.94 | 895K| 584K| 301 | | | | ||
||* 2 | WINDOW SORT PUSHED RANK | | 1 | 4427K| 13536 |00:09:08.94 | 895K| 584K| 301 | 2709K| 743K| 97M (1)| 4096 ||
|| 3 | PARTITION RANGE SINGLE | | 1 | 4427K| 932K|00:22:27.90 | 895K| 584K| 0 | | | | ||
|| 4 | TABLE ACCESS BY LOCAL INDEX ROWID| DAILYLOGDATA | 1 | 4427K| 932K|00:22:27.61 | 895K| 584K| 0 | | | | ||
||* 5 | INDEX RANGE SCAN | DAILYLOGDATA_URLCONTEXT | 1 | 17345 | 932K|00:00:00.75 | 1448 | 0 | 0 | | | | ||
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 1 - filter("RN"<=:SYS_B_2) |
| 2 - filter(ROW_NUMBER() OVER ( PARTITION BY "DAILYLOGDATA"."CONTEXTID" ORDER BY "OCCURRENCEDATE")<=:SYS_B_2) |
| 5 - access("SHORTURL"=:P1) |
| |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

但是,如果我们只查询特定 contextid 的前 10 个结果,我们可以更快地执行此操作:

SELECT
*
FROM
(
SELECT
dailylogdata.*
FROM
dailylogdata
WHERE
shorturl in (?)
and contextid = ?
)
WHERE
rownum <= 10;

运行此查询的统计信息:

|-------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers ||
|-------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.01 | 14 ||
||* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:00.01 | 14 ||
|| 2 | PARTITION RANGE SINGLE | | 1 | 10 | 10 |00:00:00.01 | 14 ||
|| 3 | TABLE ACCESS BY LOCAL INDEX ROWID| DAILYLOGDATA | 1 | 10 | 10 |00:00:00.01 | 14 ||
||* 4 | INDEX RANGE SCAN | DAILYLOGDATA_URLCONTEXT | 1 | 1 | 10 |00:00:00.01 | 5 ||
|-------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 1 - filter(ROWNUM<=10) |
| 4 - access("SHORTURL"=:P1 AND "CONTEXTID"=TO_NUMBER(:P2)) |
| |
+-------------------------------------------------------------------------------------------------------------------------+

在这种情况下,Oracle 足够聪明,可以在获得 10 个结果后停止检索数据。我可以收集一整套contextid,并以编程方式生成一个查询,该查询由每个contextid的一个查询实例组成,并将整个困惑的情况联合在一起,但考虑到数量庞大如果没有 contextid,我们可能会遇到 Oracle 内部限制,即使没有,这种方法也有混杂的味道。

有谁知道一种方法可以保持第一个查询的简单性,同时保留与第二个查询相称的性能?另请注意,我实际上并不关心检索一组稳定的行;只要它们满足我的标准,就可以用于回归。

编辑: Adam Musch 的建议成功了。我将性能结果与他的更改一起附加到此处,因为我无法将它们放入对他的答案的评论响应中。这次我还使用了更大的数据集进行测试,以下是我原始 row_number 方法的(缓存的)统计数据以进行比较:

|-------------------------------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem ||
|-------------------------------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 12624 |00:00:22.34 | 1186K| 931K| | | ||
||* 1 | VIEW | | 1 | 1163K| 12624 |00:00:22.34 | 1186K| 931K| | | ||
||* 2 | WINDOW NOSORT | | 1 | 1163K| 1213K|00:00:21.82 | 1186K| 931K| 3036M| 17M| ||
|| 3 | TABLE ACCESS BY INDEX ROWID| TWTEST | 1 | 1163K| 1213K|00:00:20.41 | 1186K| 931K| | | ||
||* 4 | INDEX RANGE SCAN | TWTEST_URLCONTEXT | 1 | 1163K| 1213K|00:00:00.81 | 8568 | 0 | | | ||
|-------------------------------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 1 - filter("RN"<=10) |
| 2 - filter(ROW_NUMBER() OVER ( PARTITION BY "CONTEXTID" ORDER BY NULL )<=10) |
| 4 - access("SHORTURL"=:P1) |
+-------------------------------------------------------------------------------------------------------------------------------------------------+

我冒昧地稍微简化了 Adam 的建议;这是修改后的查询...

select
*
from
twtest
where
rowid in (
select
rowid
from (
select
rowid,
shorturl,
row_number() over (partition by shorturl, contextid
order by null) rn
from
twtest
)
where rn <= 10
and shorturl in (?)
);

...以及来自其(缓存)评估的统计数据:

|--------------------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem ||
|--------------------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 12624 |00:00:01.33 | 19391 | | | ||
|| 1 | NESTED LOOPS | | 1 | 1 | 12624 |00:00:01.33 | 19391 | | | ||
|| 2 | VIEW | VW_NSO_1 | 1 | 1163K| 12624 |00:00:01.27 | 6770 | | | ||
|| 3 | HASH UNIQUE | | 1 | 1 | 12624 |00:00:01.27 | 6770 | 1377K| 1377K| 5065K (0)||
||* 4 | VIEW | | 1 | 1163K| 12624 |00:00:01.25 | 6770 | | | ||
||* 5 | WINDOW NOSORT | | 1 | 1163K| 1213K|00:00:01.09 | 6770 | 283M| 5598K| ||
||* 6 | INDEX RANGE SCAN | TWTEST_URLCONTEXT | 1 | 1163K| 1213K|00:00:00.40 | 6770 | | | ||
|| 7 | TABLE ACCESS BY USER ROWID| TWTEST | 12624 | 1 | 12624 |00:00:00.04 | 12621 | | | ||
|--------------------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 4 - filter("RN"<=10) |
| 5 - filter(ROW_NUMBER() OVER ( PARTITION BY "SHORTURL","CONTEXTID" ORDER BY NULL NULL )<=10) |
| 6 - access("SHORTURL"=:P1) |
| |
|Note |
|----- |
| - dynamic sampling used for this statement (level=2) |
| |
+--------------------------------------------------------------------------------------------------------------------------------------+

正如宣传的那样,我们仅访问 dailylogdata 表中完全过滤的行。我担心它似乎仍然根据它声称要选择的行数(1213K)对 urlcontext 索引进行全面扫描,但考虑到它只使用 6770 个缓冲区(即使我增加特定于上下文的结果的数量,这个数字仍然保持不变)这可能会产生误导。

最佳答案

这是一种笨拙的解决方案,但它似乎可以实现您想要的功能:尽快缩短索引扫描,并且在通过过滤条件和顶部都合格之前不读取数据-n 查询条件。

请注意,它是使用 shorturl = 条件而不是 shorturl IN 条件进行测试的。

with rowid_list as
(select rowid
from (select *
from (select rowid,
row_number() over (partition by shorturl, contextid
order by null) rn
from dailylogdata
)
where rn <= 10
)
where shorturl = ?
)
select *
from dailylogdata
where rowid in (select rowid from rowid_list)

with 子句获取前 10 个 rowid,为符合您条件的 shorturlcontextid 的每个唯一组合进行 WINDOW NOSORT 过滤。然后它循环该组 rowid,通过 rowid 获取每个 rowid。

----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 286 | 1536 (1)| 00:00:19 |
| 1 | NESTED LOOPS | | 1 | 286 | 1536 (1)| 00:00:19 |
| 2 | VIEW | VW_NSO_1 | 136K| 1596K| 910 (1)| 00:00:11 |
| 3 | HASH UNIQUE | | 1 | 3326K| | |
|* 4 | VIEW | | 136K| 3326K| 910 (1)| 00:00:11 |
|* 5 | WINDOW NOSORT | | 136K| 2794K| 910 (1)| 00:00:11 |
|* 6 | INDEX RANGE SCAN | TABLE_REDACTED_INDEX | 136K| 2794K| 910 (1)| 00:00:11 |
| 7 | TABLE ACCESS BY USER ROWID| TABLE_REDACTED | 1 | 274 | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - filter("RN"<=10)
5 - filter(ROW_NUMBER() OVER ( PARTITION BY "CLIENT_ID","SCE_ID" ORDER BY NULL NULL
)<=10)
6 - access("TABLE_REDACTED"."SHORTURL"=:b1)

关于sql - 每组最大 n 个 SQL 查询的高性能方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9485124/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com