gpt4 book ai didi

FileInputFormat,其中文件名是 KEY,文本内容是 VALUE

转载 作者:可可西里 更新时间:2023-11-01 14:23:39 26 4
gpt4 key购买 nike

我想将整个文件用作 MAP 处理的单个记录,文件名作为键。
我已阅读以下帖子:How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?
虽然最佳答案的理论是可靠的,但实际上没有提供代码或“操作方法”。

这是我自定义的 FileInputFormat 和相应的 RecordReader,它们编译,但不产生任何记录数据。
谢谢你的帮助。

public class CommentsInput
extends FileInputFormat<Text,Text> {
protected boolean isSplitable(FileSystem fs, Path filename)
{
return false;
}
@Override
public RecordReader<Text, Text> createRecordReader(InputSplit split, TaskAttemptContext ctx)
throws IOException, InterruptedException {
return new CommentFileRecordReader((FileSplit) split, ctx.getConfiguration());
}

////////////////////////

public class CommentFileRecordReader
extends RecordReader<Text,Text> {
private InputStream in;
private long start;
private long length;
private long position;
private Text key;
private Text value;
private boolean processed;
private FileSplit fileSplit;
private Configuration conf;

public CommentFileRecordReader(FileSplit fileSplit, Configuration conf) throws IOException
{
this.fileSplit = fileSplit;
this.conf=conf;
}

/** Boilerplate initialization code for file input streams. */
@Override
public void initialize(InputSplit split,
TaskAttemptContext context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();

fileSplit = (FileSplit) split;
this.start = fileSplit.getStart();
this.length = fileSplit.getLength();
this.position = 0;
this.processed = false;

Path path = fileSplit.getPath();
FileSystem fs = path.getFileSystem(conf);
FSDataInputStream in = fs.open(path);

CompressionCodecFactory codecs = new CompressionCodecFactory(conf);
CompressionCodec codec = codecs.getCodec(path);
if (codec != null)
this.in = codec.createInputStream(in);
else
this.in = in;

// If using Writables:
// key = new Text();
// value = new Text();
}
public boolean next(Text key, Text value) throws IOException
{
if(!processed)
{
key = new Text(fileSplit.getPath().toString());
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
byte[] contents = new byte[(int) fileSplit.getLength()];
try
{
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents.toString());
}
finally
{
IOUtils.closeStream(in);
}
processed = true;
return true;
}
return false;
}

@Override
public boolean nextKeyValue() throws IOException {
// TODO parse the next key value, update position and return true.
return false;
}

@Override
public Text getCurrentKey() {
return key;
}

@Override
public Text getCurrentValue() {
return value;
}

/** Returns our progress within the split, as a float between 0 and 1. */
@Override
public float getProgress() {
if (length == 0)
return 0.0f;
return Math.min(1.0f, position / (float)length);
}

@Override
public void close() throws IOException {
if (in != null)
in.close();
}
}

最佳答案

您需要找到一种方法来定义您自己的 key 类并确保您的类使用它。您可以查看如何定义您自己的 key 类,您可以通过在其路径上调用 hte getName() 方法获取文件名,然后使用它来制作您的 key 。

关于FileInputFormat,其中文件名是 KEY,文本内容是 VALUE,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5888256/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com