gpt4 book ai didi

hadoop - 以受控方式拆分 SequenceFile - Hadoop

转载 作者:可可西里 更新时间:2023-11-01 14:16:18 25 4
gpt4 key购买 nike

hadoop 以键值对(记录)格式写入 SequenceFile。考虑我们有一个大的无界日志文件。 Hadoop 会根据 block 大小拆分文件,并将它们保存在多个数据节点上。是否保证每个键值对都位于一个 block 上?或者我们可能遇到这样一种情况,即键在节点 1 的一个 block 中,而值(或其中的一部分)在节点 2 的第二个 block 中?如果我们可能有无意义的完全 split ,那么解决方案是什么?同步标记?

另一个问题是:hadoop是自动写sync markers还是我们自己写?

最佳答案

我在 hadoop 邮件列表中问过这个问题。他们回答:

Sync markers are written into sequence files already, they are part of the format. This is nothing to worry about - and is simple enough to test and be confident about. The mechanism is same as reading a text file with newlines - the reader will ensure reading off the boundary data in order to complete a record if it has to.

然后我问:

So if we have a map job analysing only the second block of the log file, it should not transfer any other parts of that from other nodes because that part is stand alone and meaning full split? Am I right?

他们回答:

Yes. Simply put, your records shall never break. We do not read just at the split boundaries, we may extend beyond boundaries until a sync marker is encountered in order to complete a record or series of records. The subsequent mappers will always skip until their first sync marker, and then begin reading - to avoid duplication. This is exactly how text file reading works as well -- only here, it is newlines.

关于hadoop - 以受控方式拆分 SequenceFile - Hadoop,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8405671/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com