gpt4 book ai didi

hadoop - 分桶在 hive 中不起作用

转载 作者:可可西里 更新时间:2023-11-01 16:01:15 25 4
gpt4 key购买 nike

我有分桶列,即使在设置所有参数后我也没有获得任何性能优势。下面是我正在使用的查询和我创建的存储桶,我还添加了解释计划结果。

select count(*) from bigtable_main a inner join 
big_cnt10000 b where a.srrecordid = b.srrecordid;
---112 seconds....

ALTER TABLE bigtable_main CLUSTERED BY(srrecordid) SORTED BY(srrecordid) INTO 40 BUCKETS ;
ALTER TABLE big_cnt10000 CLUSTERED BY(srrecordid) SORTED BY(srrecordid) INTO 40 BUCKETS ;

---112 seconds....
---------------------------------------------------
SET hive.enforce.bucketing=true;
SET hive.optimize.bucketmapjoin=true;
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

even the explain plan is same. Any idea?
Vertex dependency in root stage
Map 1 <- Map 3 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_13]
compressed:false
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [GBY_11]
| aggregations:["count(VALUE._col0)"]
| outputColumnNames:["_col0"]
| Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_10]
sort order:
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
value expressions:_col0 (type: bigint)
Group By Operator [GBY_9]
aggregations:["count()"]
outputColumnNames:["_col0"]
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Select Operator [SEL_8]
Statistics:Num rows: 31669970 Data size: 3166997036 Basic stats: COMPLETE Column stats: NONE
Filter Operator [FIL_16]
predicate:(_col0 = _col11) (type: boolean)
Statistics:Num rows: 31669970 Data size: 3166997036 Basic stats: COMPLETE Column stats: NONE
Map Join Operator [MAPJOIN_19]
| condition map:[{"":"Inner Join 0 to 1"}]
| HybridGraceHashJoin:true
| keys:{"Map 3":"srrecordid (type: string)","Map 1":"srrecordid (type: string)"}
| outputColumnNames:["_col0","_col11"]
| Statistics:Num rows: 63339940 Data size: 6333994073 Basic stats: COMPLETE Column stats: NONE
|<-Map 3 [BROADCAST_EDGE]
| Reduce Output Operator [RS_5]
| key expressions:srrecordid (type: string)
| Map-reduce partition columns:srrecordid (type: string)
| sort order:+
| Statistics:Num rows: 42529 Data size: 4252905 Basic stats: COMPLETE Column stats: NONE
| Filter Operator [FIL_18]
| predicate:srrecordid is not null (type: boolean)
| Statistics:Num rows: 42529 Data size: 4252905 Basic stats: COMPLETE Column stats: NONE
| TableScan [TS_1]
| alias:b
| Statistics:Num rows: 85058 Data size: 8505810 Basic stats: COMPLETE Column stats: NONE
|<-Filter Operator [FIL_17]
predicate:srrecordid is not null (type: boolean)
Statistics:Num rows: 57581763 Data size: 5758176306 Basic stats: COMPLETE Column stats: NONE
TableScan [TS_0]
alias:a
Statistics:Num rows: 115163525 Data size: 11516352512 Basic stats: COMPLETE Column stats: NONE

最佳答案

Hive 编译器需要元数据,元信息决定执行计划。 doc

编译器需要元数据,因此发送 getMetaData 请求并从 MetaStore 接收 sendMetaData 请求。

此元数据用于对查询树中的表达式进行类型检查,以及根据查询谓词修剪分区。编译器生成的计划是阶段的 DAG,每个阶段是映射/归约作业、元数据操作或 HDFS 上的操作。对于 map/reduce 阶段,计划包含 map 运算符树(在映射器上执行的运算符树)和 reduce 运算符树(用于需要 reducer 的操作

Alter storage 语句更改表的物理存储属性,但不更改元数据。

使用适当的 bucket drop 和 create table。

下面是详细信息的链接。

NOTE: These commands will only modify Hive's metadata, and will NOT reorganize or reformat existing data. Users should make sure the actual data layout conforms with the metadata definition.

关于hadoop - 分桶在 hive 中不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38862647/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com