- android - RelativeLayout 背景可绘制重叠内容
- android - 如何链接 cpufeatures lib 以获取 native android 库?
- java - OnItemClickListener 不起作用,但 OnLongItemClickListener 在自定义 ListView 中起作用
- java - Android 文件转字符串
我有很大的不。对于小文件,我想使用 CombineFileInputFormat 来合并文件,以便每个文件数据都作为我的 MR 作业中的单个记录。我关注了http://yaseminavcular.blogspot.in/2011/03/many-small-input-files.html并尝试将其转换为新的 api
我面临两个问题:
a) 我只是用 2 个小文件测试它,仍然有 2 个映射器被触发。我预计 1
b) 每行都作为单个记录出现,我希望整个文件作为单个记录。
这可能很痛苦,但请查看下面的代码。我还是hadoop的小白
驱动类
public class MRDriver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
FileSystem fs = new Path(".").getFileSystem(getConf());
fs.printStatistics();
Job job = new Job(getConf());
job.setJobName("Enron MR");
job.setMapperClass(EnronMailReadMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(0);
job.setJarByClass(EnronMailReadMapper.class);
RawCombineFileInputFormat.addInputPath(job, new Path(args[0]));
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 :1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MRDriver(), args);
System.exit(exitCode);
}
下面的类主要是 LineRecordReader 的复制粘贴,修改了 initialize() 和 nextKeyValue() 函数
public class SingleFileRecordReader extends RecordReader<LongWritable, Text> {
private static final Log LOG = LogFactory.getLog(SingleFileRecordReader.class);
private long start;
private long pos;
private long end;
private LineReader in;
private int maxLineLength;
private LongWritable key = null;
private Text value = null;
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",
Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
// open the file and seek to the start of the split
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
fileIn.seek(start);
in = new LineReader(fileIn, job);
// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}
private int maxBytesToConsume(long pos) {
return (int) Math.min(Integer.MAX_VALUE, end - pos);
}
private long getFilePosition() throws IOException {
long retVal= pos;
return retVal;
}
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
StringBuffer totalValue = new StringBuffer();
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end) {
newSize = in.readLine(value, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));
if (newSize == 0) {
break;
}
totalValue.append(value.toString()+"\n");
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
value = new Text(totalValue.toString());
return true;
}
}
@Override
public LongWritable getCurrentKey() {
return key;
}
@Override
public Text getCurrentValue() {
return value;
}
/**
* Get the progress within the split
*/
public float getProgress() throws IOException {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f,
(getFilePosition() - start) / (float)(end - start));
}
}
public synchronized void close() throws IOException {
try {
if (in != null) {
in.close();
}
} finally {
}
}
其他文件
public class RawCombineFileInputFormat extends CombineFileInputFormat <LongWritable,Text>{
@Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException {
return new CombineFileRecordReader< LongWritable, Text >((CombineFileSplit) split, context, MultiFileRecordReader.class);
}
和
public class MultiFileRecordReader extends RecordReader < LongWritable, Text > {
private CombineFileSplit split;
private TaskAttemptContext context;
private int index;
private RecordReader< LongWritable, Text > rr;
public MultiFileRecordReader(CombineFileSplit split, TaskAttemptContext context, Integer index) {
this.split = split;
this.context = context;
this.index = index;
this.rr = new SingleFileRecordReader();
}
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
this.split = (CombineFileSplit) split;
this.context = context;
if (null == rr) {
rr = new SingleFileRecordReader();
}
FileSplit fileSplit = new FileSplit(this.split.getPath(index),
this.split.getOffset(index),
this.split.getLength(index),
this.split.getLocations());
this.rr.initialize(fileSplit, this.context);
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return this.rr.nextKeyValue();
}
@Override
public LongWritable getCurrentKey() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return this.rr.getCurrentKey();
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return this.rr.getCurrentValue();
}
@Override
public float getProgress() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return this.rr.getProgress();
}
@Override
public void close() throws IOException {
if (rr != null) {
rr.close();
rr = null;
}
}
最佳答案
查看此输入格式。这是一种用于在单个映射任务中读取多个文件的输入格式。传递给映射器的每条记录都将读取一个(未拆分的)文件。 WholeFileRecordReader 负责将一个文件内容作为一个值发送。返回的键是 NullWritable,值是每个文件的整体内容。现在您可以使用它并运行您的 mapreduce 作业并查看实际运行了多少映射器并检查您获得的输出是否正确。
记录是从 WholeFileRecordReader 构建的。
public class WholeFileInputFormat extends CombineFileInputFormat<NullWritable, Text>{
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
/**
* Creates a CombineFileRecordReader to read each file assigned to this InputSplit.
* Note, that unlike ordinary InputSplits, split must be a CombineFileSplit, and therefore
* is expected to specify multiple files.
*
* @param split The InputSplit to read. Throws an IllegalArgumentException if this is
* not a CombineFileSplit.
* @param context The context for this task.
* @return a CombineFileRecordReader to process each file in split.
* It will read each file with a WholeFileRecordReader.
* @throws IOException if there is an error.
*/
@Override
public RecordReader<NullWritable, Text> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException {
if (!(split instanceof CombineFileSplit)) {
throw new IllegalArgumentException("split must be a CombineFileSplit");
}
return new CombineFileRecordReader<NullWritable, Text>((CombineFileSplit) split, context, WholeFileRecordReader.class);
}
}
在上面你可以使用 WholeFileRecordReader,如下所示:-
public class WholeFileRecordReader extends RecordReader<NullWritable, Text> {
private static final Logger LOG = Logger.getLogger(WholeFileRecordReader.class);
/** The path to the file to read. */
private final Path mFileToRead;
/** The length of this file. */
private final long mFileLength;
/** The Configuration. */
private final Configuration mConf;
/** Whether this FileSplit has been processed. */
private boolean mProcessed;
/** Single Text to store the file name of the current file. */
// private final Text mFileName;
/** Single Text to store the value of this file (the value) when it is read. */
private final Text mFileText;
/**
* Implementation detail: This constructor is built to be called via
* reflection from within CombineFileRecordReader.
*
* @param fileSplit The CombineFileSplit that this will read from.
* @param context The context for this task.
* @param pathToProcess The path index from the CombineFileSplit to process in this record.
*/
public WholeFileRecordReader(CombineFileSplit fileSplit, TaskAttemptContext context,
Integer pathToProcess) {
mProcessed = false;
mFileToRead = fileSplit.getPath(pathToProcess);
mFileLength = fileSplit.getLength(pathToProcess);
mConf = context.getConfiguration();
assert 0 == fileSplit.getOffset(pathToProcess);
if (LOG.isDebugEnabled()) {
LOG.debug("FileToRead is: " + mFileToRead.toString());
LOG.debug("Processing path " + pathToProcess + " out of " + fileSplit.getNumPaths());
try {
FileSystem fs = FileSystem.get(mConf);
assert fs.getFileStatus(mFileToRead).getLen() == mFileLength;
} catch (IOException ioe) {
// oh well, I was just testing.
}
}
// mFileName = new Text();
mFileText = new Text();
}
/** {@inheritDoc} */
@Override
public void close() throws IOException {
mFileText.clear();
}
/**
* Returns the absolute path to the current file.
*
* @return The absolute path to the current file.
* @throws IOException never.
* @throws InterruptedException never.
*/
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
/**
* <p>Returns the current value. If the file has been read with a call to NextKeyValue(),
* this returns the contents of the file as a BytesWritable. Otherwise, it returns an
* empty BytesWritable.</p>
*
* <p>Throws an IllegalStateException if initialize() is not called first.</p>
*
* @return A BytesWritable containing the contents of the file to read.
* @throws IOException never.
* @throws InterruptedException never.
*/
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
return mFileText;
}
/**
* Returns whether the file has been processed or not. Since only one record
* will be generated for a file, progress will be 0.0 if it has not been processed,
* and 1.0 if it has.
*
* @return 0.0 if the file has not been processed. 1.0 if it has.
* @throws IOException never.
* @throws InterruptedException never.
*/
@Override
public float getProgress() throws IOException, InterruptedException {
return (mProcessed) ? (float) 1.0 : (float) 0.0;
}
/**
* All of the internal state is already set on instantiation. This is a no-op.
*
* @param split The InputSplit to read. Unused.
* @param context The context for this task. Unused.
* @throws IOException never.
* @throws InterruptedException never.
*/
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
// no-op.
}
/**
* <p>If the file has not already been read, this reads it into memory, so that a call
* to getCurrentValue() will return the entire contents of this file as Text,
* and getCurrentKey() will return the qualified path to this file as Text. Then, returns
* true. If it has already been read, then returns false without updating any internal state.</p>
*
* @return Whether the file was read or not.
* @throws IOException if there is an error reading the file.
* @throws InterruptedException if there is an error.
*/
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!mProcessed) {
if (mFileLength > (long) Integer.MAX_VALUE) {
throw new IOException("File is longer than Integer.MAX_VALUE.");
}
byte[] contents = new byte[(int) mFileLength];
FileSystem fs = mFileToRead.getFileSystem(mConf);
FSDataInputStream in = null;
try {
// Set the contents of this file.
in = fs.open(mFileToRead);
IOUtils.readFully(in, contents, 0, contents.length);
mFileText.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
mProcessed = true;
return true;
}
return false;
}
}
以下是您的驱动程序代码:-
public int run(String[] arg) throws Exception {
Configuration conf=getConf();
FileSystem fs = FileSystem.get(conf);
//estimate reducers
Job job = new Job(conf);
job.setJarByClass(WholeFileDriver.class);
job.setJobName("WholeFile");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(WholeFileInputFormat.class);
job.setMapperClass(WholeFileMapper.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(arg[0]));
Path output=new Path(arg[1]);
try {
fs.delete(output, true);
} catch (IOException e) {
LOG.warn("Failed to delete temporary path", e);
}
FileOutputFormat.setOutputPath(job, output);
boolean ret=job.waitForCompletion(true);
if(!ret){
throw new Exception("Job Failed");
}
关于java - 在hadoop中将文件作为单个记录读取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17875277/
这个问题在这里已经有了答案: How to initialize var? (11 个答案) 关闭 8 年前。 我想给一个变量赋初值 null,并在下一个 if-else block 中赋值,但是编
我正在使用 TypeScript 3.8 编写 JS 和 TS 混合的代码。我写了以下行: export * as Easing from './easing'; 应该是 fair game在 Typ
我需要将 R 代码中的“/”更改为“\”。我有这样的事情: tmp <- paste(getwd(),"tmp.xls",sep="/") 所以我的 tmp是 c:/Study/tmp.xls 我希望
我有个问题。例如我有这个: id truth count 1 1 1 2 1 2 3 0 0 4 1 1 5 1 2 6 1
我正在尝试使用“IN”和“=”来查找一些 bean。我目前正在使用此代码: $ids = array(1,2,3,4); $user = 1; $things = R::find( 'thing'
是否可以在 Xcode 中部署到其他人的手机上?我没有 iPhone,但我想测试我在 friend 手机上制作的应用程序。在我支付 99 美元之前,我想确保这不会造成麻烦。 谢谢。 最佳答案 不会有任
我试图得到一个非常大的数字(超过 unsigned long long int )。所以我把它作为一个字符串,然后一个数字一个数字地转换成整数并使用它。 #include #include int
我在 Rust 中有 C 语言库的绑定(bind),但它们并不完整。 在 C 代码中,我定义了一个简化的宏,如下所示: #define MY_MACROS1(PTR) (((my_struct1
我正在努力解决这个问题。 http://jsfiddle.net/yhcqfy44/ 动画应该自动相对于 滚动到顶部每次出现滚动条时的高度。 我已经写了这个,但没有运气: var hheight =
我正在处理一个将数字作为字符串返回的 JSON API。例如 "12" ,但是,该字段值也可以是非数字的,例如:"-" . 我已将 JSON 数据解析为映射,我想将此字段提取为 elixir 中的整数
我正在尝试编写一个类,将.wav文件转换为.aiff文件作为项目的一部分。 我遇到了几个库Alvas.Audio(http://alvas.net/alvas.audio,overview.aspx)
我想在 Lucene 中将像“New York”这样的“复合词”索引为单个术语,而不是像“new”、“york”那样。这样,如果有人搜索“new place”,则包含“new york”的文档将不会匹
我希望这个解释能让我更好地了解使用宏的优点。 最佳答案 在函数中,所有参数在调用之前都会被评估。 这意味着 or 作为函数不能是惰性的,而宏可以将 or 重写为 if 语句,该语句仅在以下情况下计算分
我有一些看起来像这样的 XML foo ]]> (注意 > 登录 "> foo")和 XSLT 样式表 当我运行xsltproc stylesheet.xs
当我尝试将 Any 转换为 List 时,如下面的示例所示,我得到“Unchecked cast: Any!”到列表'警告。有没有解决此类问题的方法? val x: List = objectOfTy
我正在使用 Python 开发一个简单的爬虫。目的是创建一个 sitemap.xml。(你可以在这里找到真正的 alpha 版本:http://code.google.com/p/sitemappy/
我想知道在 VBScript 中是否可以在多行中中断 If 语句。喜欢: If (UCase(Trim(objSheet.Cells(i, a).Value)) = "YES") Or _ (UCas
for (String item : someList) { System.out.println(item); } 使用“do while”是否等效? 谢谢。 最佳答案 如果列表为空,f
这个问题已经有答案了: 已关闭10 年前。 Possible Duplicate: Split string with delimiters in C 在 C 中将“,”分隔的列表拆分为数组的最佳方法
我有一个如下所示的字符数组: [0, 10, 20, 30, 670] 如何将此字符串转换为整数数组? 这是我的数组 int i=0; size_t dim = 1; char* array = (c
我是一名优秀的程序员,十分优秀!