gpt4 book ai didi

hadoop - hadoop中的输入拆分和 block

转载 作者:可可西里 更新时间:2023-11-01 15:29:33 24 4
gpt4 key购买 nike

我的文件大小为 100 MB,默认 block 大小为 64 MB。如果我不设置输入拆分大小,默认拆分大小将是 block 大小。现在拆分大小也是 64 MB。

当我将这个 100 MB 的文件加载到 HDFS 时,这个 100 MB 的文件将分成 2 个 block 。即 64 MB 和 36 MB。例如下面是一首 100 MB 大小的歌词。如果我将这些数据加载到 HDFS 中,比如从第 1 行到第 16 行的一半正好是 64 MB 作为一个拆分/ block (直到 "It made the")和第 16 行的剩余一半( children 欢笑和玩耍)到文件末尾作为第二 block (36 MB)。将有两个映射器作业。

我的问题是第一个映射器如何考虑第 16 行(即 block 1 的第 16 行),因为该 block 只有一半的行,或者第二个映射器如何考虑 block 2 的第一行,因为它是也有一半的线。

Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
Mary went, Mary went
Everywhere that Mary went
The lamb was sure to go

He followed her to school one day
School one day, school one day
He followed her to school one day
Which was against the rule
It made the children laugh and play
Laugh and play, laugh and play
It made the children laugh and play
To see a lamb at school

And so the teacher turned him out
Turned him out, turned him out
And so the teacher turned him out
But still he lingered near
And waited patiently
Patiently, patiently
And wai-aited patiently
Til Mary did appear

或者在拆分 64 MB 时,hadoop 会考虑整行 16,而不是拆分单行吗?

最佳答案

在 hadoop 中,数据是根据输入拆分大小和 block 大小读取的。

  • 文件根据大小分成多个FileSplits。每个输入拆分都使用与输入中的偏移量对应的起始参数进行初始化。

  • 当我们初始化 LineRecordReader 时,它会尝试实例化一个开始读取行的 LineReader。

  • 如果定义了 CompressionCodec,它会处理边界。所以如果InputSplit的开头不为0,则回溯1个字符,然后跳过第一行,(遇到\n或\r\n)回溯保证不跳过有效行。

代码如下:

if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
end = Long.MAX_VALUE;
} else {
if (start != 0) {
skipFirstLine = true;
--start;
fileIn.seek(start);
}
in = new LineReader(fileIn, job);
}
if (skipFirstLine) { // skip first line and re-establish "start".
start += in.readLine(new Text(), 0,
(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;

由于拆分是在客户端计算的,所以映射器不需要按顺序运行,每个映射器都已经知道是否需要丢弃第一行。

因此,在您的情况下,第一个 block B1 将从偏移量 0 读取数据到 “It made the children laugh and play”

Block B2 将从“To see a lamb at school”行到最后一行偏移量读取数据。

您可以引用这些作为引用:

https://hadoopabcd.wordpress.com/2015/03/10/hdfs-file-block-and-input-split/
How does Hadoop process records split across block boundaries?

关于hadoop - hadoop中的输入拆分和 block ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37065242/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com