gpt4 book ai didi

sql - 亚马逊 Redshift sql : remove top 1% of data as outliers

转载 作者:行者123 更新时间:2023-11-29 14:20:19 25 4
gpt4 key购买 nike

我正在尝试删除 1% 的数据,因为我们认为这些异常值会严重扭曲数据。我试过使用 SELECT TOP 99 PERC,但 Amazon Redshift 不支持 TOP 的百分比。

我试过类似的东西:

WITH 
elapsed_times AS (
SELECT
COALESCE(anonymous_id, distinct_id) as id,
elapsed_time
FROM studio_production.interaction
WHERE project_id = '55062b464a9bc578006987ff'
),
max_elapsed_time AS (
SELECT elapsed_time
FROM elapsed_times
ORDER BY elapsed_time ASC
OFFSET ROUND(0.99 * (SELECT COUNT(*) FROM elapsed_times))
LIMIT 1
),
user_times AS (
SELECT
id,
LEAST(elapsed_time, max_elapsed_time) as elapsed_time
FROM elapsed_times
GROUP BY 1
)

SELECT AVG(elapsed_time) FROM user_times

但是我得到:OFFSET 的参数不能包含子查询

因此,我的查询现在是:

WITH 
elapsed_times AS (
SELECT
COALESCE(anonymous_id, distinct_id) as id,
elapsed_time,
RANK() OVER (ORDER BY elapsed_time ASC) as rnk
FROM studio_production.interaction
WHERE project_id = '55062b464a9bc578006987ff'
),
user_times AS (
SELECT
id,
LEAST(MAX(elapsed_time), (
SELECT MIN(elapsed_time)
FROM elapsed_times
WHERE rnk > ROUND(0.99 * (SELECT COUNT(*) FROM elapsed_times))
)) as elapsed_time
FROM elapsed_times
GROUP BY 1
)

SELECT AVG(elapsed_time) FROM user_times

这实际上很慢。解决这个问题的正确方法是什么?

最佳答案

您可以使用 ntile()(参见 here):

select avg(elapsed_time)
from (select et.*,
ntile(100) over (order by elapsed_time) as thetile
from elapsed_times et
) et
where thetile not in (1, 100);

编辑:

我承认我经常使用 row_number()count() 来做到这一点:

select avg(elapsed_time)
from (select et.*,
row_number() over (order by elapsed_time) as seqnum,
count(*) over () as cnt
from elapsed_times et
) et
where (seqnum <= 0.01 * cnt) or (seqnum >= 0.99 * cnt);

关于sql - 亚马逊 Redshift sql : remove top 1% of data as outliers,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30468566/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com