gpt4 book ai didi

sql - Hive SQL多个左外部联接查询在其结果中缺少记录

转载 作者:行者123 更新时间:2023-12-02 20:47:46 25 4
gpt4 key购买 nike

我正在尝试联接7个表并将联接的数据插入到一个大联接表中,为此,我正在使用以下查询

INSERT OVERWRITE TABLE databaseName.joinTab PARTITION (tran_date)
SELECT <180 cols across all 7 tables>
FROM databaseName.table1 tab1
LEFT OUTER JOIN databaseName.table2 tab2 ON (tab1.id = tab2.id and
tab1.tran_date='20171030' and tab2.tran_date='20171030')
LEFT OUTER JOIN databaseName.table3 tab3 ON (tab1.id = tab3.id and
tab1.tran_date='20171030' and tab3.tran_date='20171030')
LEFT OUTER JOIN databaseName.table4 tab4 ON (tab1.id = tab4.id and
tab1.tran_date='20171030' and tab4.tran_date='20171030')
LEFT OUTER JOIN databaseName.table5 tab5 ON (tab1.id = tab5.id and
tab1.tran_date='20171030' and tab5.tran_date='20171030')
LEFT OUTER JOIN databaseName.table6 tab6 ON (tab1.id = tab6.id and
tab1.tran_date='20171030' and tab6.tran_date='20171030')
LEFT OUTER JOIN databaseName.table7 tab7 ON (tab1.id = tab7.id and
tab1.tran_date='20171030' and tab7.tran_date='20171030')
WHERE (tab1.tran_date='20171030');

tran_date是所有这些表的分区列,之所以我拥有where子句以及ON语句中的条件,是因为我发现启动的tez作业将对table1进行全表扫描。

所以我的问题是,如果我在tran_date = 20171030上对table1进行了count(*)处理,则结果为11845917

如果我从同一个分区tran_date = 20171030的新联接表(joinTab)中进行计数(*),我只会得到97609的结果,差异非常大,因为我使用的是左外部联接,我曾认为应该将所有数据从table1移到联接表中,并为其他表中未包含的任何内容填充null。我应该提到joinTab中的tran_date是从加载table1数据时派生的

这里有什么看起来不合适的东西吗?

谢谢你的帮助

最佳答案

我无法测试此解决方案是否有效,因为您尚未提供可重复的示例,但是您可以尝试执行以下操作:

WITH tab1_temp AS (SELECT <tab1 cols> WHERE tab1.tran_date='20171030'
)
INSERT OVERWRITE TABLE databaseName.joinTab PARTITION (tran_date)
SELECT <180 cols across all 7 tables>
FROM tab1_temp
LEFT OUTER JOIN databaseName.table2 tab2 ON (tab1.id = tab2.id and
tab1.tran_date='20171030' and tab2.tran_date='20171030')
LEFT OUTER JOIN databaseName.table3 tab3 ON (tab1.id = tab3.id and
tab1.tran_date='20171030' and tab3.tran_date='20171030')
LEFT OUTER JOIN databaseName.table4 tab4 ON (tab1.id = tab4.id and
tab1.tran_date='20171030' and tab4.tran_date='20171030')
LEFT OUTER JOIN databaseName.table5 tab5 ON (tab1.id = tab5.id and
tab1.tran_date='20171030' and tab5.tran_date='20171030')
LEFT OUTER JOIN databaseName.table6 tab6 ON (tab1.id = tab6.id and
tab1.tran_date='20171030' and tab6.tran_date='20171030')
LEFT OUTER JOIN databaseName.table7 tab7 ON (tab1.id = tab7.id and
tab1.tran_date='20171030' and tab7.tran_date='20171030')
;

关于sql - Hive SQL多个左外部联接查询在其结果中缺少记录,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47218278/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com