gpt4 book ai didi

hadoop - 将RecordReader上下文设置为Hadoop MapReduce中的一个段落

转载 作者:行者123 更新时间:2023-12-02 21:55:05 25 4
gpt4 key购买 nike

我想编写自己的RecordReader,它返回整个段落而不是象TextInputFormat中的一行那样的上下文。

我尝试了以下功能,但绝对可以通过离开

public boolean nextKeyValue() throws IOException, InterruptedException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
value.clear();
final Text endline = new Text("\n");
int newSize = 0;

Text v = new Text();
while (v!= endline) {
value.append(v.getBytes(),0, v.getLength());
value.append(endline.getBytes(),0, endline.getLength());
if (newSize == 0) {
break;
}
pos += newSize;
if (newSize < maxLineLength) {
break;
}
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}

最佳答案

实际上,您不必费力编写自己的RecordReader。相反,只需扩展TextInputFormat并更改定界符即可。以下是仅更改了分隔符的TextInputFormat的库代码:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import com.google.common.base.Charsets;

public class ParagraphInputFormat
extends TextInputFormat {
private static final String PARAGRAPH_DELIMITER = "\r\n\r\n";

@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}

@Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split, TaskAttemptContext context) {
String delimiter = PARAGRAPH_DELIMITER;
byte[] recordDelimiterBytes = null;
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
}
return new LineRecordReader(recordDelimiterBytes);
}
}

关于hadoop - 将RecordReader上下文设置为Hadoop MapReduce中的一个段落,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15593601/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com