gpt4 book ai didi

java - 从 PDFBox 剥离时的文本坐标

转载 作者:搜寻专家 更新时间:2023-11-01 02:37:07 31 4
gpt4 key购买 nike

我正在尝试使用 PDFBox 从 pdf 文件中提取带坐标的文本。

我混合了一些在互联网上找到的方法/信息(也是 stackoverflow),但我的坐标问题似乎不正确。例如,当我尝试使用坐标在 tex 顶部绘制矩形时,矩形被绘制在其他地方。

这是我的代码(请不要判断风格,写得很快只是为了测试)

TextLine.java

    import java.util.List;
import org.apache.pdfbox.text.TextPosition;

/**
*
* @author samue
*/
public class TextLine {
public List<TextPosition> textPositions = null;
public String text = "";
}

myStripper.java

    import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;

/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/

/**
*
* @author samue
*/
public class myStripper extends PDFTextStripper {
public myStripper() throws IOException
{
}

@Override
protected void startPage(PDPage page) throws IOException
{
startOfLine = true;
super.startPage(page);
}

@Override
protected void writeLineSeparator() throws IOException
{
startOfLine = true;
super.writeLineSeparator();
}

@Override
public String getText(PDDocument doc) throws IOException
{
lines = new ArrayList<TextLine>();
return super.getText(doc);
}

@Override
protected void writeWordSeparator() throws IOException
{
TextLine tmpline = null;

tmpline = lines.get(lines.size() - 1);
tmpline.text += getWordSeparator();

super.writeWordSeparator();
}


@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
TextLine tmpline = null;

if (startOfLine) {
tmpline = new TextLine();
tmpline.text = text;
tmpline.textPositions = textPositions;
lines.add(tmpline);
} else {
tmpline = lines.get(lines.size() - 1);
tmpline.text += text;
tmpline.textPositions.addAll(textPositions);
}

if (startOfLine)
{
startOfLine = false;
}
super.writeString(text, textPositions);
}

boolean startOfLine = true;
public ArrayList<TextLine> lines = null;

}

AWT 按钮上的点击事件

 private void jButton1MouseClicked(java.awt.event.MouseEvent evt) {                                      
// TODO add your handling code here:
try {
File file = new File("C:\\Users\\samue\\Desktop\\mwb_I_201711.pdf");
PDDocument doc = PDDocument.load(file);

myStripper stripper = new myStripper();

stripper.setStartPage(1); // fix it to first page just to test it
stripper.setEndPage(1);
stripper.getText(doc);

TextLine line = stripper.lines.get(1); // the line i want to paint on

float minx = -1;
float maxx = -1;

for (TextPosition pos: line.textPositions)
{
if (pos == null)
continue;

if (minx == -1 || pos.getTextMatrix().getTranslateX() < minx) {
minx = pos.getTextMatrix().getTranslateX();
}
if (maxx == -1 || pos.getTextMatrix().getTranslateX() > maxx) {
maxx = pos.getTextMatrix().getTranslateX();
}
}

TextPosition firstPosition = line.textPositions.get(0);
TextPosition lastPosition = line.textPositions.get(line.textPositions.size() - 1);

float x = minx;
float y = firstPosition.getTextMatrix().getTranslateY();
float w = (maxx - minx) + lastPosition.getWidth();
float h = lastPosition.getHeightDir();

PDPageContentStream contentStream = new PDPageContentStream(doc, doc.getPage(0), PDPageContentStream.AppendMode.APPEND, false);

contentStream.setNonStrokingColor(Color.RED);
contentStream.addRect(x, y, w, h);
contentStream.fill();
contentStream.close();

File fileout = new File("C:\\Users\\samue\\Desktop\\pdfbox.pdf");
doc.save(fileout);
doc.close();
} catch (Exception ex) {

}
}

有什么建议吗?我做错了什么?

最佳答案

这只是过度 PdfTextStripper 坐标规范化的另一种情况。就像你一样,我曾认为通过使用 TextPosition.getTextMatrix()(而不是 getX()getY)可以得到实际坐标,但不,即使这些矩阵值也必须更正(至少在 PDFBox 2.0.x 中,我没有检查 1.8.x),因为矩阵乘以平移,使裁剪框的左下角成为原点。

因此,在您的情况下(裁剪框的左下角不是原点),您必须更正这些值,例如通过替换

        float x = minx;
float y = firstPosition.getTextMatrix().getTranslateY();

通过

        PDRectangle cropBox = doc.getPage(0).getCropBox();

float x = minx + cropBox.getLowerLeftX();
float y = firstPosition.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY();

代替

without correction

你现在得到

with x,y correction

不过,显然,您还必须稍微修正一下高度。这是由于 PdfTextStripper 确定文本高度的方式:

    // 1/2 the bbox is used as the height todo: why?
float glyphHeight = bbox.getHeight() / 2;

(来自 LegacyPDFStreamEngine 中的 showGlyph(...)PdfTextStripper 的父类)

虽然字体边界框确实通常太大,但通常只有一半是不够的。

关于java - 从 PDFBox 剥离时的文本坐标,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46080131/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com