gpt4 book ai didi

hadoop - 在 hiveql 中,如果某些数据隐式不存在,计算平均值的最优雅/最有效的方法是什么?

转载 作者:可可西里 更新时间:2023-11-01 14:50:57 25 4
gpt4 key购买 nike

在 Hiveql 中,当数据中存在“差距”并且它们之间存在隐式重复值时,计算平均值的最优雅和最高效的方法是什么?即考虑具有以下数据的表格:

+----------+----------+----------+
| Employee | Date | Balance |
+----------+----------+----------+
| John | 20181029 | 1800.2 |
| John | 20181105 | 2937.74 |
| John | 20181106 | 3000 |
| John | 20181110 | 1500 |
| John | 20181119 | -755.5 |
| John | 20181120 | -800 |
| John | 20181121 | 1200 |
| John | 20181122 | -400 |
| John | 20181123 | -900 |
| John | 20181202 | -1300 |
+----------+----------+----------+

如果我尝试计算 11 月行的简单平均值,它将返回 ~722.78,但平均值应考虑未显示的天数与前一个寄存器的余额相同。例如,在上述数据中,John 在 20181101 和 20181104 之间有 1800.2。

假设表中的每个日期/余额总是只有一行,并且假设我无法更改此数据的存储方式(并且可能不应该因为在余额不变的情况下为数天写入行会浪费存储空间),我一直在尝试从查询月中所有天数的子查询中获取平均值,为缺席天数返回 NULL,然后使用 case 以相反的顺序从上一个可用日期获取余额。所有这一切只是为了避免写入临时表。

最佳答案

第一步:原始数据

第一步是用原始数据重新创建一个表。假设原始表名为 daily_employee_balance

daily_employee_balance

use default;
drop table if exists daily_employee_balance;
create table if not exists daily_employee_balance (
employee_id string,
employee string,
iso_date date,
balance double
);

在原始表 daily_employee_balance 中插入示例数据

insert into table daily_employee_balance values 
('103','John','2018-10-25',1800.2),
('103','John','2018-10-29',1125.7),
('103','John','2018-11-05',2937.74),
('103','John','2018-11-06',3000),
('103','John','2018-11-10',1500),
('103','John','2018-11-19',-755.5),
('103','John','2018-11-20',-800),
('103','John','2018-11-21',1200),
('103','John','2018-11-22',-400),
('103','John','2018-11-23',-900),
('103','John','2018-12-02',-1300);

第 2 步:维度表

您将需要一个维度表,其中有一个日历(包含所有可能日期的表),将其命名为 dimension_date。这是具有日历表的正常行业标准,您可能可以通过 Internet 下载此示例数据。

use default;
drop table if exists dimension_date;
create external table dimension_date(
date_id int,
iso_date string,
year string,
month string,
month_desc string,
end_of_month_flg string
);

插入 2018 年 11 月整月的一些示例数据:

insert into table dimension_date values
(6880,'2018-11-01','2018','2018-11','November','N'),
(6881,'2018-11-02','2018','2018-11','November','N'),
(6882,'2018-11-03','2018','2018-11','November','N'),
(6883,'2018-11-04','2018','2018-11','November','N'),
(6884,'2018-11-05','2018','2018-11','November','N'),
(6885,'2018-11-06','2018','2018-11','November','N'),
(6886,'2018-11-07','2018','2018-11','November','N'),
(6887,'2018-11-08','2018','2018-11','November','N'),
(6888,'2018-11-09','2018','2018-11','November','N'),
(6889,'2018-11-10','2018','2018-11','November','N'),
(6890,'2018-11-11','2018','2018-11','November','N'),
(6891,'2018-11-12','2018','2018-11','November','N'),
(6892,'2018-11-13','2018','2018-11','November','N'),
(6893,'2018-11-14','2018','2018-11','November','N'),
(6894,'2018-11-15','2018','2018-11','November','N'),
(6895,'2018-11-16','2018','2018-11','November','N'),
(6896,'2018-11-17','2018','2018-11','November','N'),
(6897,'2018-11-18','2018','2018-11','November','N'),
(6898,'2018-11-19','2018','2018-11','November','N'),
(6899,'2018-11-20','2018','2018-11','November','N'),
(6900,'2018-11-21','2018','2018-11','November','N'),
(6901,'2018-11-22','2018','2018-11','November','N'),
(6902,'2018-11-23','2018','2018-11','November','N'),
(6903,'2018-11-24','2018','2018-11','November','N'),
(6904,'2018-11-25','2018','2018-11','November','N'),
(6905,'2018-11-26','2018','2018-11','November','N'),
(6906,'2018-11-27','2018','2018-11','November','N'),
(6907,'2018-11-28','2018','2018-11','November','N'),
(6908,'2018-11-29','2018','2018-11','November','N'),
(6909,'2018-11-30','2018','2018-11','November','Y');

第 3 步:事实表

从原始表创建一个事实表。在正常实践中,您将数据提取到 hdfs/hive,然后处理原始数据并创建一个包含历史数据的表,您可以在其中以增量方式不断插入。您可以深入了解数据仓库以获得正确的定义,但我将其称为事实表 - f_employee_balance

这将重新创建包含缺失日期的原始表,并使用之前已知的余额填充缺失的余额。

--inner query to get all the possible dates
--outer self join query will populate the missing dates and balance
drop table if exists f_employee_balance;
create table f_employee_balance
stored as orc tblproperties ("orc.compress"="SNAPPY") as
select q1.employee_id, q1.iso_date,
nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
month, year from (
select distinct
r.employee_id,
d.iso_date as iso_date,
d.month, d.year
from daily_employee_balance r, dimension_date d )q1
left outer join daily_employee_balance r on
(q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);

第 4 步:分析

下面的查询将为您提供按月的真实平均值:

select employee_id, monthly_avg, month, year from (
select employee_id,
row_number() over (partition by employee_id,year,month) as row_num,
avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
f_employee_balance)q1
where row_num = 1
order by year, month;

第 5 步:结论

您可以将第 3 步和第 4 步合并在一起;这将使您免于创建额外的表。当您身处大数据世界时,您不必担心浪费额外的磁盘空间或开发时间。您可以轻松添加另一个磁盘或节点,并使用工作流自动执行该过程。有关详细信息,请查看数据仓库概念和配置单元分析查询。

关于hadoop - 在 hiveql 中,如果某些数据隐式不存在,计算平均值的最优雅/最有效的方法是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54012272/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com