gpt4 book ai didi

sql - 在Oracle中独立高效地从多列中找到top-N值

转载 作者:行者123 更新时间:2023-12-04 23:07:45 27 4
gpt4 key购买 nike

假设我有 300 亿行和多列,我想独立地有效地找到前 N 个最频繁的值,并使用最优雅的 SQL。例如,如果我有

FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Ferris Freemont Possum Ubik
Nancy Freemont Lemur Housekeeping
Nancy Drew Penguin Ubik
Bill Ribbits Lemur Dhalgren

我想要前 1,那么结果将是:
FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Nancy Freemont Lemur Ubik

我可能会想办法做到这一点,但不确定它们是否是最佳的,这在有 300 亿行时很重要;并且 SQL 可能又大又丑,而且可能会使用太多的临时空间。

使用甲骨文。

最佳答案

这应该只在 table 上做一次。您可以使用 count() 的解析版本独立获取每个值的频率:

select firstname, count(*) over (partition by firstname) as c_fn,
lastname, count(*) over (partition by lastname) as c_ln,
favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
favoritebook, count(*) over (partition by favoritebook) as c_fb
from my_table;

FIRSTN C_FN LASTNAME C_LN FAVORIT C_FA FAVORITEBOOK C_FB
------ ---- -------- ---- ------- ---- ------------ ----
Bill 1 Ribbits 1 Lemur 2 Dhalgren 1
Ferris 1 Freemont 2 Possum 1 Ubik 2
Nancy 2 Freemont 2 Lemur 2 Housekeeping 1
Nancy 2 Drew 1 Penguin 1 Ubik 2

然后,您可以将其用作 CTE(或子查询分解,我认为在 oracle 术语中)并仅从每列中提取最高频率值:
with tmp_tab as (
select /*+ MATERIALIZE */
firstname, count(*) over (partition by firstname) as c_fn,
lastname, count(*) over (partition by lastname) as c_ln,
favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
favoritebook, count(*) over (partition by favoritebook) as c_fb
from my_table)
select (select firstname from (
select firstname,
row_number() over (partition by null order by c_fn desc) as r_fn
from tmp_tab
) where r_fn = 1) as firstname,
(select lastname from (
select lastname,
row_number() over (partition by null order by c_ln desc) as r_ln
from tmp_tab
) where r_ln = 1) as lastname,
(select favoriteanimal from (
select favoriteanimal,
row_number() over (partition by null order by c_fa desc) as r_fa
from tmp_tab
) where r_fa = 1) as favoriteanimal,
(select favoritebook from (
select favoritebook,
row_number() over (partition by null order by c_fb desc) as r_fb
from tmp_tab
) where r_fb = 1) as favoritebook
from dual;

FIRSTN LASTNAME FAVORIT FAVORITEBOOK
------ -------- ------- ------------
Nancy Freemont Lemur Ubik

您正在对每一列的 CTE 进行一次传递,但这仍然应该只命中真正的表一次(感谢 materialize 提示)。您可能想添加到 order by条款来调整如果有关系怎么办。

这在概念上与 Thilo、ysth 和其他人的建议相似,只是您让 Oracle 跟踪所有计数。

编辑:嗯,解释计划显示它做了四次全表扫描;可能需要多考虑一下……
编辑 2:添加(未记录) MATERIALIZE提示 CTE 似乎解决了这个问题;它正在创建一个临时临时表来保存结果,并且只执行一次全表扫描。不过,解释计划的成本更高——至少在这次样本数据集上是这样。对这样做的任何缺点的任何评论感兴趣。

关于sql - 在Oracle中独立高效地从多列中找到top-N值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7278905/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com