sql - 如何确保在雪花中使用 mode() 的查询的确定性结果-6ren

sql - 如何确保在雪花中使用 mode() 的查询的确定性结果

转载作者：行者123 更新时间：2023-12-04 02:40:19

26

4

我使用雪花，我想使用多个 mode()一个 select 语句中的表达式。所以它看起来像:

SELECT
x,
y,
mode(col1),
mode(col2),
...
mode(col15)
FROM table
GROUP BY x, y

我的问题是它在关系的情况下会产生不确定的输出。
该文档没有准确解释如何解决关系。它只说:

If there is a tie for most frequent value (two or more values occur as frequently as each other, and more frequently than any other value), MODE returns one of those values.

https://docs.snowflake.net/manuals/sql-reference/functions/mode.html
我需要一些解决方法来获得等效的 mode() ，这将始终导致确定性输出。
类似于:使用 mode() ，但在某些列并列的情况下，请选择第一个值。

我不提供复制不确定性结果的示例，因为它似乎只发生在更大的数据集或复杂的查询中。

最佳答案

所以模式似乎更喜欢它在决胜局中看到的第一个值。

with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
    ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
)
select x
    ,mode(col1)
    ,mode(col2)
    ,mode(col3)
from data 
group by 1
order by 1;

交换 2/20 或 3/30 对的第一个值显示了这一点。

所以建立一个模式，试图用一个表达式来解决这个问题:

with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
    ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
)
select x
    ,col1
    ,col2
    ,col3
    ,count(col1)over(partition by x,col1) c_col1
    ,count(col2)over(partition by x,col2) c_col2
    ,count(col3)over(partition by x,col3) c_col3
from data ;

借给自己:

with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
    ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
)
select x
    ,col1
    ,col2
    ,col3 
    ,row_number() over (partition by x order by c_col1 desc, col1) as r1
    ,row_number() over (partition by x order by c_col2 desc, col2) as r2
    ,row_number() over (partition by x order by c_col3 desc, col3) as r3
from (
  select x
      ,col1
      ,col2
      ,col3
      ,count(col1)over(partition by x,col1) c_col1
      ,count(col2)over(partition by x,col2) c_col2
      ,count(col3)over(partition by x,col3) c_col3
  from data 
)
order by 1;

虽然有这些结果:

X   COL1    COL2    COL3    R1  R2  R3
1   1   2   3   2   1   1
1   2   2   3   3   2   2
1   1   1   3   1   3   3
4   1   2   3   2   1   1
4   2   20  3   4   4   2
4   2   2   30  3   2   3
4   1   20  30  1   3   4

你不能像这样使用逻辑

QUALIFY row_number() over (partition by x order by c_col1 desc, col1) = 1
  AND row_number() over (partition by x order by c_col2 desc, col2) = 1
  AND row_number() over (partition by x order by c_col3 desc, col3 desc) = 1

选择最好的，因为每一列的最佳行没有对齐。

这导致每列的 CTE(或子查询)，这与 Gorndon 显示的模式非常相似。

with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
    ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
),col1_m as (
    select x, col1, count(*) as c 
    from data 
    group by 1,2
    QUALIFY row_number() over (partition by x order by c desc, col1) = 1
),col2_m as (
    select x, col2, count(*) as c 
    from data 
    group by 1,2
    QUALIFY row_number() over (partition by x order by c desc, col2) = 1
),col3_m as (
    select x, col3, count(*) as c 
    from data 
    group by 1,2
    QUALIFY row_number() over (partition by x order by c desc, col3) = 1
), base as (
select distinct x from data
)
select b.x
    ,c1.col1
    ,c2.col2
    ,c3.col3
from base as b
left join col1_m as c1 on b.x = c1.x
left join col2_m as c2 on b.x = c2.x
left join col3_m as c3 on b.x = c3.x
order by 1;

这给出了您期望的结果

X   COL1    COL2    COL3
1   1   2   3
4   1   2   3

但是您需要将 X 扩展为您关心的一组事物 (x,y,..)，等等。

关于sql - 如何确保在雪花中使用 mode() 的查询的确定性结果，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59719757/

26

4

0

文章推荐： c - 强制整数常量为四字

文章推荐： python - 将 QChartView 插入 ui

文章推荐： Prolog - 规则是正确的，但没有按照预期的方式输出？

Java Deflater，现在和将来的结果相同(确定性)
我使用 Deflater 编写了一个备份程序SHA-1 用于存储文件和哈希值。我看到Java的Deflater使用zlib 。如果我显式设置 Deflater 的级别，无论平台和 JRE 版本如何，我
python - 该算法是否在单位圆盘上生成均匀分布(确定性 RNG)？
考虑以下算法: r = 2 while r >= 1: x = -1 + 2 * random.random() y = -1 + 2 * random.random() r
python - 确定性 key 序列化
我正在编写一个持久保存到磁盘的映射类。我目前只允许 str键，但如果我可以使用更多类型会很好:希望最多可以是任何可散列的(即与内置 dict 相同的要求)，但更合理的是我会接受字符串、unicode、
python - 确定性 python 脚本以非确定性方式运行
我有一个不使用随机化的脚本，当我运行它时会给出不同的答案。我希望每次运行脚本时答案都是一样的。该问题似乎只发生在某些(病态)输入数据上。该代码段来自一种计算线性系统特定类型 Controller 的
python - 制作 scrapy.Request 确定性？
这对我来说不是问题，没有它我也可以生活，但我只是好奇这是否可能以及如何实现。今天我了解到，scrapy.Request 不会按照启动的顺序完成。伪代码示例: class SomeSpider(sc
python - scipy linalg 确定性/非确定性代码
我正在运行这个 SVD来自 scipy 的求解器，代码如下: import numpy as np from scipy.sparse.linalg import svds features = np
c++ - 确定性 Miller-Rabin 实现
我正在尝试使用确定性 Miller-Rabin 算法实现素数检查功能，但结果并不总是正确的:在检查前 1,000,000 个数字时，它只找到 78,495 而不是 78,498。这是使用 [2, 7
Android:声音 API(确定性、低延迟)
我正在审查各种 Android 声音 API，我想知道我应该使用哪一个。我的目标是获得低延迟的音频，或者至少是关于播放延迟的确定性行为。我们遇到了很多问题，Android 声音 API 似乎很垃圾，
caching - (非)确定性 CPU 行为和关于(物理)执行持续时间的推理
过去，我处理过对时间要求严格的软件开发。这些应用程序的开发基本上是这样进行的:“让我们编写代码，测试延迟和抖动，并优化它们，直到它们在可接受的范围内。”我觉得这非常令人沮丧。这不是我所说的适当的工程
sql-server - T-SQL 确定性 INT 转换
给定: SQL Server 表名为 TEST_TABLE TEST_TABLE 中名为 TEST_FIELD 的列 VARCHAR(50) NOT NULL 第 1 行:10YR3/6 第 2 行:
c++ - 确定性 C++ 程序的 "Random"输出。可能的原因？
我在 64 位 Windows PC 上使用 Microsoft Visual Studio Community 2015，版本 14.xxx。程序读取一个文本文件，其中每一行都是桥牌(四名玩家，每
PHP 种子、确定性、加密安全 PRNG(伪随机数生成器)。可能吗？
我需要在 PHP 中创建一个可证明公平(确定性和种子)加密安全 (CS) 随机数生成器。我们正在运行 PHP 5，而 PHP 7 现在并不是一个真正的选择。但是，我找到了 PHP 7 的新 CS 函数

首页

博学

6Ren·AI

商城

sql - 如何确保在雪花中使用 mode() 的查询的确定性结果