gpt4 book ai didi

mysql - 数据库查询以获取Hive的Max,Min和其他列的对应值以及Hive中单个表的Total Record Count

转载 作者:行者123 更新时间:2023-12-02 22:02:06 25 4
gpt4 key购买 nike

我在Hive表名称中有以下数据集-发布

N.B. PUBLISH中可能有重复的记录

DATE    |HOUR|SOURCE|COL_TIMESTAMP              |ID
20200101|14 |A |2020-01-01 14:18:53.016 GMT|ID_111
20200101|14 |A |2020-01-01 14:18:53.012 GMT|ID_222
20200101|14 |A |2020-01-01 14:18:53.016 GMT|ID_111
20200101|14 |A |2020-01-01 14:18:53.019 GMT|ID_333
20200101|15 |C |2020-01-01 15:18:53.016 GMT|ID_444
20200102|00 |A |2020-01-01 15:18:53.016 GMT|ID_444

我想根据特定的日期,小时和来源生成以下输出
例如。对于( DATE=20200101HOUR=14SOURCE=A),输出应为:
DATE    |HOUR|SOURCE|MIN_TIMESTAMP              |START_ID|MAX_TIMESTAMP              |END_ID|RECORD_CNT
20200101|14 |A |2020-01-01 14:18:53.012 GMT|ID_222 |2020-01-01 14:18:53.019 GMT|ID_333|3

N.B.时间戳的末尾带有“GMT”。
我也试图使用Spark Java代码运行查询。
请提出一个 hive 查询,当数据量很大时,它将显示出良好的性能。

最佳答案

使用分析函数获取START_ID和LAST_ID,然后进行汇总:

with PUBLISH as ( --Use your_table instead of this CTE
select stack(6,
'20200101','14','A','2020-01-01 14:18:53.016 GMT','ID_111',
'20200101','14','A','2020-01-01 14:18:53.012 GMT','ID_222',
'20200101','14','A','2020-01-01 14:18:53.016 GMT','ID_111',
'20200101','14','A','2020-01-01 14:18:53.019 GMT','ID_333',
'20200101','15','C','2020-01-01 15:18:53.016 GMT','ID_444',
'20200102','00','A','2020-01-01 15:18:53.016 GMT','ID_444'
) as (DT, HOUR, SOURCE, COL_TIMESTAMP, ID)
)

select DT, HOUR, SOURCE,
min(COL_TIMESTAMP) as MIN_TIMESTAMP,
START_ID,
max(COL_TIMESTAMP) as MAX_TIMESTAMP,
END_ID,
sum(case when rn=1 then 1 else 0 end) as RECORD_CNT --unique records have rn=1
from
(
select DT, HOUR, SOURCE, COL_TIMESTAMP, ID,
first_value(ID) over(partition by DT, HOUR, SOURCE order by COL_TIMESTAMP) as START_ID,
first_value(ID) over(partition by DT, HOUR, SOURCE order by COL_TIMESTAMP desc) as END_ID,
row_number() over(partition by DT, HOUR, SOURCE, COL_TIMESTAMP, ID) as rn
from PUBLISH p
) s
group by DT, HOUR, SOURCE, START_ID, END_ID;

结果:
dt  hour    source  min_timestamp   start_id    max_timestamp   end_id  record_cnt
20200101 14 A 2020-01-01 14:18:53.012 GMT ID_222 2020-01-01 14:18:53.019 GMT ID_333 3
20200101 15 C 2020-01-01 15:18:53.016 GMT ID_444 2020-01-01 15:18:53.016 GMT ID_444 1
20200102 00 A 2020-01-01 15:18:53.016 GMT ID_444 2020-01-01 15:18:53.016 GMT ID_444 1

关于mysql - 数据库查询以获取Hive的Max,Min和其他列的对应值以及Hive中单个表的Total Record Count,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60123925/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com