gpt4 book ai didi

hadoop - Hive 查询执行计划

转载 作者:可可西里 更新时间:2023-11-01 15:26:56 25 4
gpt4 key购买 nike

这是我的配置单元查询

Insert into schemaB.employee partition(year) 
select * from schemaA.employee;

下面是这个查询产生的查询执行计划。

hive> explain <query>;

STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0

STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: employee
Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: Col1 (type: binary), col2 (type: binary), col3 (type: array<string>), year (type: int)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col3 (type: int)
sort order: +
Map-reduce partition columns: _col3 (type: int)
Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: binary), _col1 (type: binary), _col2 (type: array<string>), _col3 (type: int)
Reduce Operator Tree:
Extract
Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: schemaB.employee

Stage: Stage-0
Move Operator
tables:
partition:
year
replace: false
table:
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: schemaB.employee

Stage: Stage-2
Stats-Aggr Operator

我有两个与查询执行计划相关的问题:

  1. 为什么查询计划中有一个reduce步骤?在我的理解中,它需要做的就是将数据从一个HDFS位置复制到另一个位置,这可以单独通过映射器来实现。 reduce 步骤是否与表中存在的分区有关?
  2. 阶段 2 中的Stats-Aggr Operator 步骤是什么?我找不到对此进行解释的相关文档。

最佳答案

这回答了这两个问题。
默认情况下会自动收集统计信息,为此需要减少步骤。

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics

hive.stats.autogather

Default Value: true 

Added In: Hive 0.7 with HIVE-1361

A flag to gather statistics automatically during the INSERT OVERWRITEcommand.

关于hadoop - Hive 查询执行计划,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43448218/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com