gpt4 book ai didi

hadoop - Hive Count 性能和说明指示

转载 作者:行者123 更新时间:2023-12-02 21:58:46 24 4
gpt4 key购买 nike

我有 2 张 table 。

第一个具有 ORC 格式,分区如下:{year,month,day,type} 和 ~60Millions 行。

第二个有 TextInputFormat,分区如下:{date,type} 和 ~300Millions 行。

当我对这两个表执行“SELECT COUNT(*)”时,第一个会在几分钟后给出结果。

解释计划是:

Plan not optimized by CBO.
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2 vectorized
File Output Operator [FS_107648]
compressed:true
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}
Group By Operator [OP_107647]
| aggregations:["count(VALUE._col0)"]
| outputColumnNames:["_col0"]
| Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
|<-Map 1 [SIMPLE_EDGE] vectorized
Reduce Output Operator [RS_107641]
sort order:
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
value expressions:_col0 (type: bigint)
Group By Operator [OP_107646]
aggregations:["count()"]
outputColumnNames:["_col0"]
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Select Operator [OP_107645]
Statistics:Num rows: 64930697 Data size: 158452219904 Basic stats: PARTIAL Column stats: NONE
TableScan [TS_107638]
alias:mytable
Statistics:Num rows: 64930697 Data size: 158452219904 Basic stats: PARTIAL Column stats: NONE

当我对第二个查询执行相同的查询时,它会在不到 5 秒后给我一个结果...

解释计划是:
Plan not optimized by CBO.
Stage-0
Fetch Operator
limit:1

我猜要么涉及分区,要么涉及格式...
有谁了解这种情况,可以向我解释:)吗?

最佳答案

您正在使用 hive.compute.query.using.stats=true .
您的第二个查询只是从元存储中获取行数

hive.compute.query.using.stats
Default Value: false
Added In: Hive 0.13.0 with HIVE-5483
When set to true Hive will answer a few queries like min, max, and count(1) purely using statistics stored in the metastore. For basic statistics collection, set the configuration property hive.stats.autogather to true. For more advanced statistics collection, run ANALYZE TABLE queries.

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics

关于hadoop - Hive Count 性能和说明指示,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44929293/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com