gpt4 book ai didi

hadoop - 如何优化我的配置单元查询以从多个表中查找记录计数总和

转载 作者:可可西里 更新时间:2023-11-01 16:06:30 28 4
gpt4 key购买 nike

我必须生成一份报告,该报告将为我提供表 A、B 和 C 中使用 Hive 存储的事件的计数总和,并且我的 S3 存储桶已按 Organization_id 分区

例如:表 A – 有约翰(和其他员工)每天上类的记录表 B – 记录了约翰(和其他员工)在工作中调用或接听的每个电话表 C – 记录了约翰(和其他员工)在工作中提交的每笔费用

基本上,我想要约翰 (employee_id) 在上个月来自 A、B 和 C 的计数总和。如果在 3 个表 A、B 或 C 中的任何一个中都有记录,则应该每个日期只有一个记录(如果一个或多个表中有一个日期的记录,则对计数求和)表)。所以我的输出是:

Employee id
Employee Name
Date
Count
123
John
02-Jan-2016
55
123
John
12-Jan-2016
88
123
John
19-Jan-2016
103

我提出的查询是:

select  adcts.employee_name, adcts.employee_id,Total_count as event_count, adcts.event_date  
from
(select coalesce(Evts.employee_id,imps.employee_id,AEvts.employee_id) as employee_id
, coalesce(Evts.employee_name,imps.employee_name,AEvts.employee_name) as employee_name
, coalesce(Evts.Event_count,0) + coalesce(Imps.Impression_count,0) + coalesce (AEvts.Event_Count,0)as Total_Count
, coalesce (Evts.event_date,imps.impression_date, AEvts.event_date) as event_date
from
(select employee_id, employee_name, count(*) as Event_count,event_date
from mm_events
where organization_id = 100048
and event_date between '2016-02-01' and '2016-02-04'
group by employee_id, employee_name,event_date) Evts
full outer join
(select employee_id, employee_name, count(*) as Impression_count, impression_date
from mm_impressions
where organization_id = 100048
and impression_date between '2016-02-01' and '2016-02-04'
group by employee_id, employee_name,impression_date) Imps
on Evts.employee_id = Imps.employee_id
full outer join
(select employee_id, employee_name, count(*) as Event_count,event_date
from mm_attributed_events
where organization_id = 100048
and event_date between '2016-02-01' and '2016-02-04'
and event_type = 'click'
group by employee_id, employee_name,event_date) AEvts
on AEvts.employee_id=Evts.employee_id
) adcts
join
(select distinct c.employee_id from default.t1_meta_dmp c
where c.employee_dmp_enabled='inherits'
and c.agency_dmp_enabled = 'inherits'
and c.agency_status='true'
and c.employee_status='true'
and c.organization_id = 100048) cc
on adcts.employee_id=cc.employee_id
order by adcts.employee_id asc

我有两个问题:

<强>1。我有正确的查询吗?2. 因为我使用的是“完全外部联接”,所以同一日期我得到了不止一个条目。有人可以建议一种更好的方法来实现结果吗?可能有不同的查询

最佳答案

对于同一 date,您将获得多个条目,因为您在子查询中按 date 分组,但仅通过 employee_id 加入它们。这就是为什么您的记录在加入后会重复的原因。您还应该将 event_date 添加到连接条件。

看来您根本不需要FULL JOIN。 Join 比 union all 更昂贵。使用 UNION ALL 从每个表中选择,然后 group by employee_name, employee_id, event_date 和 aggregate count() :

select employee_id, employee_name, sum(Event_count) as Total_Count , event_date 
from
(
select employee_id, employee_name, count(*) as Event_count, event_date from mm_events
where organization_id = 100048 and event_date between '2016-02-01' and '2016-02-04'
group by employee_id, employee_name, event_date

union all
select employee_id, employee_name, count(*) as Event_count, impression_date as event_date
from mm_impressions
where organization_id = 100048 and impression_date between '2016-02-01' and '2016-02-04'
group by employee_id, employee_name,impression_date

union all
select employee_id, employee_name, count(*) as Event_count,event_date
from mm_attributed_events
where organization_id = 100048 and event_date between '2016-02-01' and '2016-02-04' and event_type = 'click'
group by employee_id, employee_name, event_date
) adcts
group by employee_id, employee_name, event_date

将您的 join with cc 查询添加到上述查询中。

UNION ALL 中的所有子查询将并行运行

关于hadoop - 如何优化我的配置单元查询以从多个表中查找记录计数总和,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36313801/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com