gpt4 book ai didi

sql - 如何确保在雪花中使用 mode() 的查询的确定性结果

转载 作者:行者123 更新时间:2023-12-04 02:40:19 26 4
gpt4 key购买 nike

我使用雪花,我想使用多个 mode()一个 select 语句中的表达式。所以它看起来像:

SELECT
x,
y,
mode(col1),
mode(col2),
...
mode(col15)
FROM table
GROUP BY x, y

我的问题是它在关系的情况下会产生不确定的输出。
该文档没有准确解释如何解决关系。它只说:

If there is a tie for most frequent value (two or more values occur as frequently as each other, and more frequently than any other value), MODE returns one of those values.



https://docs.snowflake.net/manuals/sql-reference/functions/mode.html
我需要一些解决方法来获得等效的 mode() ,这将始终导致确定性输出。
类似于:使用 mode() ,但在某些列并列的情况下,请选择第一个值。

我不提供复制不确定性结果的示例,因为它似乎只发生在更大的数据集或复杂的查询中。

最佳答案

所以模式似乎更喜欢它在决胜局中看到的第一个值。

with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
)
select x
,mode(col1)
,mode(col2)
,mode(col3)
from data
group by 1
order by 1;

交换 2/20 或 3/30 对的第一个值显示了这一点。

所以建立一个模式,试图用一个表达式来解决这个问题:
with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
)
select x
,col1
,col2
,col3
,count(col1)over(partition by x,col1) c_col1
,count(col2)over(partition by x,col2) c_col2
,count(col3)over(partition by x,col3) c_col3
from data ;

借给自己:
with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
)
select x
,col1
,col2
,col3
,row_number() over (partition by x order by c_col1 desc, col1) as r1
,row_number() over (partition by x order by c_col2 desc, col2) as r2
,row_number() over (partition by x order by c_col3 desc, col3) as r3
from (
select x
,col1
,col2
,col3
,count(col1)over(partition by x,col1) c_col1
,count(col2)over(partition by x,col2) c_col2
,count(col3)over(partition by x,col3) c_col3
from data
)
order by 1;

虽然有这些结果:
X   COL1    COL2    COL3    R1  R2  R3
1 1 2 3 2 1 1
1 2 2 3 3 2 2
1 1 1 3 1 3 3
4 1 2 3 2 1 1
4 2 20 3 4 4 2
4 2 2 30 3 2 3
4 1 20 30 1 3 4

你不能像这样使用逻辑
QUALIFY row_number() over (partition by x order by c_col1 desc, col1) = 1
AND row_number() over (partition by x order by c_col2 desc, col2) = 1
AND row_number() over (partition by x order by c_col3 desc, col3 desc) = 1

选择最好的,因为每一列的最佳行没有对齐。

这导致每列的 CTE(或子查询),这与 Gorndon 显示的模式非常相似。
with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
),col1_m as (
select x, col1, count(*) as c
from data
group by 1,2
QUALIFY row_number() over (partition by x order by c desc, col1) = 1
),col2_m as (
select x, col2, count(*) as c
from data
group by 1,2
QUALIFY row_number() over (partition by x order by c desc, col2) = 1
),col3_m as (
select x, col3, count(*) as c
from data
group by 1,2
QUALIFY row_number() over (partition by x order by c desc, col3) = 1
), base as (
select distinct x from data
)
select b.x
,c1.col1
,c2.col2
,c3.col3
from base as b
left join col1_m as c1 on b.x = c1.x
left join col2_m as c2 on b.x = c2.x
left join col3_m as c3 on b.x = c3.x
order by 1;

这给出了您期望的结果
X   COL1    COL2    COL3
1 1 2 3
4 1 2 3

但是您需要将 X 扩展为您关心的一组事物 (x,y,..),等等。

关于sql - 如何确保在雪花中使用 mode() 的查询的确定性结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59719757/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com