gpt4 book ai didi

java - 如何在 map reduce 程序中解析 PDF 文件?

转载 作者:可可西里 更新时间:2023-11-01 15:38:43 27 4
gpt4 key购买 nike

我想在我的 hadoop 2.2.0 程序中解析 PDF 文件,我找到了 this , 按照它说的,直到现在,我有这三个类:

  1. PDFWordCount : 包含 map 和 reduce 函数的主类。 (就像 native hadoop wordcount 示例,但我使用了我的 TextInputFormat 类而不是 PDFInputFormat
  2. PDFRecordReader extends RecordReader<LongWritable, Text> : 这是这里的主要工作。特别是我把我的 initialize此处的函数以获得更多说明。

    public void initialize(InputSplit genericSplit, TaskAttemptContext context)
    throws IOException, InterruptedException {
    System.out.println("initialize");
    System.out.println(genericSplit.toString());
    FileSplit split = (FileSplit) genericSplit;
    System.out.println("filesplit convertion has been done");
    final Path file = split.getPath();
    Configuration conf = context.getConfiguration();
    conf.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
    FileSystem fs = file.getFileSystem(conf);
    System.out.println("fs has been opened");
    start = split.getStart();
    end = start + split.getLength();
    System.out.println("going to open split");
    FSDataInputStream filein = fs.open(split.getPath());
    System.out.println("going to load pdf");
    PDDocument pd = PDDocument.load(filein);
    System.out.println("pdf has been loaded");
    PDFTextStripper stripper = new PDFTextStripper();
    in =
    new LineReader(new ByteArrayInputStream(stripper.getText(pd).getBytes(
    "UTF-8")));
    start = 0;
    this.pos = start;
    System.out.println("init has finished");
    }

    (你可以看到我的 system.out.println 用于调试。此方法无法转换 genericSplitFileSplit .我在控制台中看到的最后一件事是:

    hdfs://localhost:9000/in:0+9396432

    这是genericSplit.toString()

  3. PDFInputFormat extends FileInputFormat<LongWritable, Text> : 这只是创建 new PDFRecordReadercreateRecordReader方法。

我想知道我的错误是什么?

我需要额外的类(class)还是什么?

最佳答案

阅读 PDF 并不难,您需要扩展类 FileInputFormat 以及 RecordReader。 FileInputClass 不能拆分 PDF 文件,因为它们是二进制文件。

public class PDFInputFormat extends FileInputFormat<Text, Text> {

@Override
public RecordReader<Text, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException {
return new PDFLineRecordReader();
}

// Do not allow to ever split PDF files, even if larger than HDFS block size
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}

}

RecordReader 然后自己执行读取(我正在使用 PDFBox 读取 PDF)。

public class PDFLineRecordReader extends RecordReader<Text, Text> {

private Text key = new Text();
private Text value = new Text();
private int currentLine = 0;
private List<String> lines = null;

private PDDocument doc = null;
private PDFTextStripper textStripper = null;

@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {

FileSplit fileSplit = (FileSplit) split;
final Path file = fileSplit.getPath();

Configuration conf = context.getConfiguration();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream filein = fs.open(fileSplit.getPath());

if (filein != null) {

doc = PDDocument.load(filein);

// Konnte das PDF gelesen werden?
if (doc != null) {
textStripper = new PDFTextStripper();
String text = textStripper.getText(doc);

lines = Arrays.asList(text.split(System.lineSeparator()));
currentLine = 0;

}

}
}

// False ends the reading process
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {

if (key == null) {
key = new Text();
}

if (value == null) {
value = new Text();
}

if (currentLine < lines.size()) {
String line = lines.get(currentLine);

key.set(line);

value.set("");
currentLine++;

return true;
} else {

// All lines are read? -> end
key = null;
value = null;
return false;
}
}

@Override
public Text getCurrentKey() throws IOException, InterruptedException {
return key;
}

@Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}

@Override
public float getProgress() throws IOException, InterruptedException {
return (100.0f / lines.size() * currentLine) / 100.0f;
}

@Override
public void close() throws IOException {

// If done close the doc
if (doc != null) {
doc.close();
}

}

希望这对您有所帮助!

关于java - 如何在 map reduce 程序中解析 PDF 文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20758956/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com