gpt4 book ai didi

具有多个分区的 Hive 表

转载 作者:行者123 更新时间:2023-12-03 09:07:26 30 4
gpt4 key购买 nike

我有一个表(data_table),其中包含多个分区列年/月/月键。

目录看起来像year=2017/month=08/monthkey=2017-08/files.parquet

以下哪个查询会更快?

从data_table中选择count(*),其中monthkey='2017-08'

从 data_table 中选择 count(*),其中 Monthkey='2017-08' 且年份 = '2017' 且月份 = '08'

我认为在第一种情况下hadoop take查找所需目录所需的初始时间会更多。但想确认一下

最佳答案

查找相关分区是元存储操作,不是文件系统操作。
这是通过查询 Metasore 来完成的,不是通过扫描目录来完成。
第一个用例的 Metasore 查询很可能比第二个用例更快,但无论如何,我们在这里谈论的是几分之一秒。

演示

create external table t100k(i int)
partitioned by (x int,y int,xy string)
;

explain dependency select count(*) from t100k where xy='100-1000';

针对元存储发出的查询:

select "PARTITIONS"."PART_ID" 
from "PARTITIONS"
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = 't100k'
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = 'local_db'
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2
where (("FILTER2"."PART_KEY_VAL" = '100-1000'))


explain dependency select count(*) from t100k where x=100 and y=1000 and xy='100-1000';

针对元存储发出的查询:

select "PARTITIONS"."PART_ID" 
from "PARTITIONS"
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = 't100k'
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = 'local_db'
inner join "PARTITION_KEY_VALS" "FILTER0" on "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0
inner join "PARTITION_KEY_VALS" "FILTER1" on "FILTER1"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER1"."INTEGER_IDX" = 1
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2
where ( ( (((case when "FILTER0"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER0"."PART_KEY_VAL" as decimal(21,0)) else null end) = 100)
and ((case when "FILTER1"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER1"."PART_KEY_VAL" as decimal(21,0)) else null end) = 1000))
and ("FILTER2"."PART_KEY_VAL" = '100-1000')) )

关于具有多个分区的 Hive 表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45955952/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com