gpt4 book ai didi

hadoop - 将配置单元表标记为已复制/较小

转载 作者:行者123 更新时间:2023-12-02 20:07:35 26 4
gpt4 key购买 nike

是否可以告诉hive某个表“很小”,即应将其复制到所有节点并在RAM中进行操作?

最佳答案

尝试以下提示:

/*+ MAPJOIN(small_table) */  

UPDATE 顺便说一句,还有其他选项,例如 sort-merge-bucket join。但是它们要求对输入表的更改要存储在同一列中。

这是Hortonworks文档中有关Map Joins限制/功能的一些信息

HortonWorks docs on Mapside join optimizations

为了方便起见,这里有关于mapjoins的摘录
MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through.

Local work:
read records via standard table scan (includes filters and projections) from source on local machine
build hashtable in memory
write hashtable to local disk
upload hashtable to dfs
add hashtable to distributed cache
Map task
read hashtable from local disk (distributed cache) into memory
match records? keys against hashtable
combine matches and write to output
No reduce task
Limitations of Current Implementation

The current MAPJOIN implementation has the following limitations:

The mapjoin operator can only handle one key at a time; that is, it can perform a multi-table join, but only if all the tables are joined on the same key. (Typical star schema joins do not fall into this category.)
Hints are cumbersome for users to apply correctly and auto conversion doesn't have enough logic to consistently predict if a MAPJOIN will fit into memory or not.
A chain of MAPJOINs is not coalesced into a single map-only job, unless the query is written as a cascading sequence of mapjoin(table, subquery(mapjoin(table, subquery....). Auto conversion will never produce a single map-only job.
The hashtable for the mapjoin operator has to be generated for each run of the query, which involves downloading all the data to the Hive client machine as well as uploading the generated hashtable files.

关于hadoop - 将配置单元表标记为已复制/较小,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21193312/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com