gpt4 book ai didi

java - CombineFileInputFormat Hadoop 0.20.205 的实现

转载 作者:搜寻专家 更新时间:2023-10-31 08:17:27 25 4
gpt4 key购买 nike

有人可以指出我在哪里可以找到 CombineFileInputFormat 的实现(org.using Hadoop 0.20.205?这是使用 EMR 从非常小的日志文件(行中的文本)创建大的拆分.

令人惊讶的是,Hadoop 没有专门为此目的创建的此类的默认实现,谷歌搜索看起来我不是唯一对此感到困惑的人。我需要编译该类并将其捆绑在一个 jar 中以进行 hadoop-streaming,我对 Java 的了解有限,这是一些挑战。

编辑:我已经使用必要的导入尝试了 yetitrails 示例,但我在下一个方法中遇到了编译器错误。

最佳答案

这是我为您准备的一个实现:

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.LineRecordReader;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.lib.CombineFileInputFormat;
import org.apache.hadoop.mapred.lib.CombineFileRecordReader;
import org.apache.hadoop.mapred.lib.CombineFileSplit;

@SuppressWarnings("deprecation")
public class CombinedInputFormat extends CombineFileInputFormat<LongWritable, Text> {

@SuppressWarnings({ "unchecked", "rawtypes" })
@Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter) throws IOException {

return new CombineFileRecordReader(conf, (CombineFileSplit) split, reporter, (Class) myCombineFileRecordReader.class);
}

public static class myCombineFileRecordReader implements RecordReader<LongWritable, Text> {
private final LineRecordReader linerecord;

public myCombineFileRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer index) throws IOException {
FileSplit filesplit = new FileSplit(split.getPath(index), split.getOffset(index), split.getLength(index), split.getLocations());
linerecord = new LineRecordReader(conf, filesplit);
}

@Override
public void close() throws IOException {
linerecord.close();

}

@Override
public LongWritable createKey() {
// TODO Auto-generated method stub
return linerecord.createKey();
}

@Override
public Text createValue() {
// TODO Auto-generated method stub
return linerecord.createValue();
}

@Override
public long getPos() throws IOException {
// TODO Auto-generated method stub
return linerecord.getPos();
}

@Override
public float getProgress() throws IOException {
// TODO Auto-generated method stub
return linerecord.getProgress();
}

@Override
public boolean next(LongWritable key, Text value) throws IOException {

// TODO Auto-generated method stub
return linerecord.next(key, value);
}

}
}

在您的作业中,首先根据您希望将输入文件组合成的大小设置参数 mapred.max.split.size。在您的 run() 中执行如下操作:

...
if (argument != null) {
conf.set("mapred.max.split.size", argument);
} else {
conf.set("mapred.max.split.size", "134217728"); // 128 MB
}
...

conf.setInputFormat(CombinedInputFormat.class);
...

关于java - CombineFileInputFormat Hadoop 0.20.205 的实现,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14270317/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com