gpt4 book ai didi

google-bigquery - 如何在 BigQuery 中模拟线性回归 (REGR_) 函数?

转载 作者:行者123 更新时间:2023-12-01 12:24:03 24 4
gpt4 key购买 nike

某些 SQL 数据库使用称为 REGR_SLOPE、REGR_R2、REGR_SXX 等的一系列函数提供成对值的线性回归。BigQuery 用户如何模拟这些函数?

最佳答案

编辑编辑:请参阅linear regression is now natively supported in BigQuery . --Fh


REGR_ 函数可以转换为涉及 BigQuery 实现的聚合的表达式,但在处理 NULL 时应格外小心。 REGR_ 函数接受由一个自变量组成的对,我们称它为“x”和一个因变量“y”,并忽略其中任一变量为 NULL 的对。

以下查询一次计算所有 REGR_ 表达式。它使用 BigQuery 标准 SQL 编写,并利用用户定义的 SQL 函数使查询更加简洁。通过测试表达式中的 x 和 y 是否为 NULL,此版本保留了键。换句话说,如果某个键的所有对都包含 NULL,则结果将为 NULL。

CREATE TEMPORARY FUNCTION NonNullX(y FLOAT64, x FLOAT64)
RETURNS FLOAT64
AS (IF(x IS NOT NULL AND y IS NOT NULL, x, NULL));

CREATE TEMPORARY FUNCTION NonNullY(y FLOAT64, x FLOAT64)
RETURNS FLOAT64
AS (IF(x IS NOT NULL AND y IS NOT NULL, y, NULL));

CREATE TEMPORARY FUNCTION NonNullCount(y FLOAT64, x FLOAT64)
RETURNS INT64
AS (IF(x IS NOT NULL AND y IS NOT NULL, 1, 0));

CREATE TEMPORARY FUNCTION REGR_R2_COEF(y_var FLOAT64, x_var FLOAT64, xy_corr FLOAT64)
RETURNS FLOAT64
AS (IF(x_var = 0, NULL, IF(y_var = 0 AND x_var <> 0, 1, POWER(xy_corr, 2))));

SELECT k,
AVG(NonNullX(y, x)) AS REGR_AVGX,
AVG(NonNullY(y, x)) AS REGR_AVGY,
SUM(NonNullCount(y, x)) AS REGR_COUNT,
AVG(NonNullY(y, x))-(COVAR_POP(NonNullY(y, x), NonNullX(y, x))/VAR_POP(NonNullX(y, x)))*AVG(NonNullX(y, x)) AS REGR_INTERCEPT,
REGR_R2_COEF(VAR_POP(NonNullY(y, x)), VAR_POP(NonNullX(y, x)), CORR(NonNullY(y, x), NonNullX(y, x))) AS REGR_R2,
COVAR_POP(NonNullY(y, x), NonNullX(y, x))/VAR_POP(NonNullX(y, x)) AS REGR_SLOPE,
SUM(NonNullCount(y, x)) * VAR_POP(NonNullX(y, x)) AS REGR_SXX,
SUM(NonNullCount(y, x)) * COVAR_POP(NonNullY(y, x), NonNullX(y, x)) AS REGR_SXY,
SUM(NonNullCount(y, x)) * VAR_POP(NonNullY(y, x)) AS REGR_SYY
FROM Temp.SimpleRegressionData
GROUP BY k;

如果您不关心 NULL 结果,您可以在 WHERE 子句中过滤掉包含 NULL 的对,聚合表达式将简化如下:

CREATE TEMPORARY FUNCTION REGR_R2_COEF(y_var FLOAT64, x_var FLOAT64, xy_corr FLOAT64)
RETURNS FLOAT64
AS (IF(x_var = 0, NULL, IF(y_var = 0 AND x_var <> 0, 1, POWER(xy_corr, 2))));

SELECT k,
AVG(x) AS REGR_AVGX,
AVG(y) AS REGR_AVGY,
SUM(1) AS REGR_COUNT,
AVG(y)-(COVAR_POP(y, x)/VAR_POP(x))*AVG(x) AS REGR_INTERCEPT,
REGR_R2_COEF(VAR_POP(y), VAR_POP(x), CORR(y, x)) AS REGR_R2,
COVAR_POP(y, x)/VAR_POP(x) AS REGR_SLOPE,
SUM(1) * VAR_POP(x) AS REGR_SXX,
SUM(1) * COVAR_POP(y, x) AS REGR_SXY,
SUM(1) * VAR_POP(y) AS REGR_SYY
FROM Temp.SimpleRegressionData
WHERE x IS NOT NULL AND Y IS NOT NULL
GROUP BY k;

请注意,如果某个键的所有非 NULL x 值都相等,则这些查询中的任何一个都可能产生被零除的错误。

关于google-bigquery - 如何在 BigQuery 中模拟线性回归 (REGR_) 函数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41625819/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com