gpt4 book ai didi

java - 如何使用 PDFBox 确定实际 PDF 内容的位置?

转载 作者:塔克拉玛干 更新时间:2023-11-02 19:07:21 25 4
gpt4 key购买 nike

我们正在使用 PDFBox 从 Java 桌面应用程序打印一些 PDF,并且 PDF 包含太多空格(不幸的是,修复 PDF 生成器不是一个选项)。

我遇到的问题是确定页面上实际内容的位置,因为裁剪/媒体/修剪/艺术/出血框没有用。有没有比将页面呈现为图像并检查哪些像素保持白色更好/更快的简单有效的方法?

enter image description here

最佳答案

正如您在评论中提到的那样

it can be assumed that there is no background or other elements that would need special handling,

我将展示一个没有任何此类特殊处理的基本解决方案。

一个基本的边界框查找器

要在不实际渲染位图和检查位图像素的情况下找到边界框,必须扫描页面内容流的所有指令以及从那里引用的任何 XObject。确定每条指令绘制的东西的边界框,并最终将它们组合成一个框。

这里介绍的简单框查找器通过简单地返回它们联合的边界框来组合它们。

为了扫描内容流的指令,PDFBox 提供了许多基于PDFStreamEngine 的类。简单框查找器派生自 PDFGraphicsStreamEngine,它通过一些与 vector 图形相关的方法扩展了 PDFStreamEngine

public class BoundingBoxFinder extends PDFGraphicsStreamEngine {
public BoundingBoxFinder(PDPage page) {
super(page);
}

public Rectangle2D getBoundingBox() {
return rectangle;
}

//
// Text
//
@Override
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
throws IOException {
super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
Shape shape = calculateGlyphBounds(textRenderingMatrix, font, code);
if (shape != null) {
Rectangle2D rect = shape.getBounds2D();
add(rect);
}
}

/**
* Copy of <code>org.apache.pdfbox.examples.util.DrawPrintTextLocations.calculateGlyphBounds(Matrix, PDFont, int)</code>.
*/
private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
{
GeneralPath path = null;
AffineTransform at = textRenderingMatrix.createAffineTransform();
at.concatenate(font.getFontMatrix().createAffineTransform());
if (font instanceof PDType3Font)
{
// It is difficult to calculate the real individual glyph bounds for type 3 fonts
// because these are not vector fonts, the content stream could contain almost anything
// that is found in page content streams.
PDType3Font t3Font = (PDType3Font) font;
PDType3CharProc charProc = t3Font.getCharProc(code);
if (charProc != null)
{
BoundingBox fontBBox = t3Font.getBoundingBox();
PDRectangle glyphBBox = charProc.getGlyphBBox();
if (glyphBBox != null)
{
// PDFBOX-3850: glyph bbox could be larger than the font bbox
glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
path = glyphBBox.toGeneralPath();
}
}
}
else if (font instanceof PDVectorFont)
{
PDVectorFont vectorFont = (PDVectorFont) font;
path = vectorFont.getPath(code);

if (font instanceof PDTrueTypeFont)
{
PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
}
if (font instanceof PDType0Font)
{
PDType0Font t0font = (PDType0Font) font;
if (t0font.getDescendantFont() instanceof PDCIDFontType2)
{
int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
}
}
}
else if (font instanceof PDSimpleFont)
{
PDSimpleFont simpleFont = (PDSimpleFont) font;

// these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
// which is why PDVectorFont is tried first.
String name = simpleFont.getEncoding().getName(code);
path = simpleFont.getPath(name);
}
else
{
// shouldn't happen, please open issue in JIRA
System.out.println("Unknown font class: " + font.getClass());
}
if (path == null)
{
return null;
}
return at.createTransformedShape(path.getBounds2D());
}

//
// Bitmaps
//
@Override
public void drawImage(PDImage pdImage) throws IOException {
Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
for (int x = 0; x < 2; x++) {
for (int y = 0; y < 2; y++) {
add(ctm.transformPoint(x, y));
}
}
}

//
// Paths
//
@Override
public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
addToPath(p0, p1, p2, p3);
}

@Override
public void clip(int windingRule) throws IOException {
}

@Override
public void moveTo(float x, float y) throws IOException {
addToPath(x, y);
}

@Override
public void lineTo(float x, float y) throws IOException {
addToPath(x, y);
}

@Override
public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
addToPath(x1, y1);
addToPath(x2, y2);
addToPath(x3, y3);
}

@Override
public Point2D getCurrentPoint() throws IOException {
return null;
}

@Override
public void closePath() throws IOException {
}

@Override
public void endPath() throws IOException {
rectanglePath = null;
}

@Override
public void strokePath() throws IOException {
addPath();
}

@Override
public void fillPath(int windingRule) throws IOException {
addPath();
}

@Override
public void fillAndStrokePath(int windingRule) throws IOException {
addPath();
}

@Override
public void shadingFill(COSName shadingName) throws IOException {
}

void addToPath(Point2D... points) {
Arrays.asList(points).forEach(p -> addToPath(p.getX(), p.getY()));
}

void addToPath(double newx, double newy) {
if (rectanglePath == null) {
rectanglePath = new Rectangle2D.Double(newx, newy, 0, 0);
} else {
rectanglePath.add(newx, newy);
}
}

void addPath() {
if (rectanglePath != null) {
add(rectanglePath);
rectanglePath = null;
}
}

void add(Rectangle2D rect) {
if (rectangle == null) {
rectangle = new Rectangle2D.Double();
rectangle.setRect(rect);
} else {
rectangle.add(rect);
}
}

void add(Point2D... points) {
for (Point2D point : points) {
add(point.getX(), point.getY());
}
}

void add(double newx, double newy) {
if (rectangle == null) {
rectangle = new Rectangle2D.Double(newx, newy, 0, 0);
} else {
rectangle.add(newx, newy);
}
}

Rectangle2D rectanglePath = null;
Rectangle2D rectangle = null;
}

(github 上的 BoundingBoxFinder)

如您所见,我从 PDFBox 示例类中借用了 calculateGlyphBounds 辅助方法。

使用示例

您可以像这样使用 BoundingBoxFinderPDDocument pdDocument 的给定 PDPage pdPage 沿边界框边缘绘制边框线:

void drawBoundingBox(PDDocument pdDocument, PDPage pdPage) throws IOException {
BoundingBoxFinder boxFinder = new BoundingBoxFinder(pdPage);
boxFinder.processPage(pdPage);
Rectangle2D box = boxFinder.getBoundingBox();
if (box != null) {
try ( PDPageContentStream canvas = new PDPageContentStream(pdDocument, pdPage, AppendMode.APPEND, true, true)) {
canvas.setStrokingColor(Color.magenta);
canvas.addRect((float)box.getMinX(), (float)box.getMinY(), (float)box.getWidth(), (float)box.getHeight());
canvas.stroke();
}
}
}

( DetermineBoundingBox 辅助方法)

结果是这样的:

Screenshot

只是概念验证

请注意,BoundingBoxFinder 确实不是很复杂;特别是它不会忽略不可见的内容,如白色背景矩形、在“不可见”渲染模式下绘制的文本、白色填充路径覆盖的任意内容、位图图像的白色部分……此外,它确实忽略了剪辑路径,很奇怪混合模式、注释、...

扩展类以正确处理这些情况非常简单,但要添加的代码总和将超出堆栈溢出答案的范围。


对于此答案中的代码,我使用了当前的 PDFBox 3.0.0-SNAPSHOT 开发分支,但对于当前的 2.x 版本它也应该开箱即用。

关于java - 如何使用 PDFBox 确定实际 PDF 内容的位置?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52821421/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com