java - IText 像 pdftotext -layout 一样读取 PDF？-6ren

java - IText 像 pdftotext -layout 一样读取 PDF？

转载作者：行者123 更新时间：2023-12-02 05:36:52

我正在寻找实现 java 解决方案的最简单方法，该解决方案与

的输出非常安静

pdftotext -layout FILE

在 Linux 机器上。 (当然它也应该便宜)

我刚刚尝试了 IText、PDFBox 和 PDFTextStream 的一些代码片段。迄今为止最准确的解决方案是 PDFTextStream，它使用 VisualOutputTarget 来获得我的文件的良好表示。

所以我的列布局被认为是正确的，我可以使用它。但是IText也应该有一个解决方案，或者？

我发现的每个简单的片段都会产生简单的有序字符串，这些字符串是一团糟(搞乱了行/列/行)。是否有任何可能更容易并且可能不涉及自己的策略的解决方案？或者有我可以使用的开源策略吗？

//我按照 mkl 的说明编写了自己的策略对象，如下所示:

package com.test.pdfextractiontest.itext;

import ...


public class MyLocationTextExtractionStrategy implements TextExtractionStrategy {

    /** set to true for debugging */
    static boolean DUMP_STATE = false;

    /** a summary of all found text */
    private final List<TextChunk> locationalResult = new ArrayList<TextChunk>();


    public MyLocationTextExtractionStrategy() {
    }


    @Override
    public void beginTextBlock() {
    }


    @Override
    public void endTextBlock() {
    }

    private boolean startsWithSpace(final String str) {
        if (str.length() == 0) {
            return false;
        }
        return str.charAt(0) == ' ';
    }


    private boolean endsWithSpace(final String str) {
        if (str.length() == 0) {
            return false;
        }
        return str.charAt(str.length() - 1) == ' ';
    }

    private List<TextChunk> filterTextChunks(final List<TextChunk> textChunks, final TextChunkFilter filter) {
        if (filter == null) {
            return textChunks;
        }

        final List<TextChunk> filtered = new ArrayList<TextChunk>();
        for (final TextChunk textChunk : textChunks) {
            if (filter.accept(textChunk)) {
                filtered.add(textChunk);
            }
        }
        return filtered;
    }


    protected boolean isChunkAtWordBoundary(final TextChunk chunk, final TextChunk previousChunk) {
        final float dist = chunk.distanceFromEndOf(previousChunk);

        if (dist < -chunk.getCharSpaceWidth() || dist > chunk.getCharSpaceWidth() / 2.0f) {
            return true;
        }

        return false;
    }

    public String getResultantText(final TextChunkFilter chunkFilter) {
        if (DUMP_STATE) {
            dumpState();
        }

        final List<TextChunk> filteredTextChunks = filterTextChunks(this.locationalResult, chunkFilter);
        Collections.sort(filteredTextChunks);

        final StringBuffer sb = new StringBuffer();
        TextChunk lastChunk = null;
        for (final TextChunk chunk : filteredTextChunks) {

            if (lastChunk == null) {
                sb.append(chunk.text);
            } else {
                if (chunk.sameLine(lastChunk)) {

                    if (isChunkAtWordBoundary(chunk, lastChunk) && !startsWithSpace(chunk.text)
                            && !endsWithSpace(lastChunk.text)) {
                        sb.append(' ');
                    }
                    final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
                    for(int i = 0; i<Math.round(dist); i++) {
                        sb.append(' ');
                    }
                    sb.append(chunk.text);
                } else {
                    sb.append('\n');
                    sb.append(chunk.text);
                }
            }
            lastChunk = chunk;
        }

        return sb.toString();
    }

返回一个带有结果文本的字符串。 */ @覆盖公共(public)字符串 getResultantText() {

        return getResultantText(null);

    }

    private void dumpState() {
        for (final TextChunk location : this.locationalResult) {
            location.printDiagnostics();

            System.out.println();
        }

    }


    @Override
    public void renderText(final TextRenderInfo renderInfo) {
        LineSegment segment = renderInfo.getBaseline();
        if (renderInfo.getRise() != 0) { 

            final Matrix riseOffsetTransform = new Matrix(0, -renderInfo.getRise());
            segment = segment.transformBy(riseOffsetTransform);
        }
        final TextChunk location =
                new TextChunk(renderInfo.getText(), segment.getStartPoint(), segment.getEndPoint(),
                        renderInfo.getSingleSpaceWidth(),renderInfo);
        this.locationalResult.add(location);
    }

    public static class TextChunk implements Comparable<TextChunk> {
        /** the text of the chunk */
        private final String text;
        /** the starting location of the chunk */
        private final Vector startLocation;
        /** the ending location of the chunk */
        private final Vector endLocation;
        /** unit vector in the orientation of the chunk */
        private final Vector orientationVector;
        /** the orientation as a scalar for quick sorting */
        private final int orientationMagnitude;

        private final TextRenderInfo info;

        private final int distPerpendicular;

        private final float distParallelStart;

        private final float distParallelEnd;
        /** the width of a single space character in the font of the chunk */
        private final float charSpaceWidth;

        public TextChunk(final String string, final Vector startLocation, final Vector endLocation,
                final float charSpaceWidth,final TextRenderInfo ri) {
            this.text = string;
            this.startLocation = startLocation;
            this.endLocation = endLocation;
            this.charSpaceWidth = charSpaceWidth;

            this.info = ri;

            Vector oVector = endLocation.subtract(startLocation);
            if (oVector.length() == 0) {
                oVector = new Vector(1, 0, 0);
            }
            this.orientationVector = oVector.normalize();
            this.orientationMagnitude =
                    (int) (Math.atan2(this.orientationVector.get(Vector.I2), this.orientationVector.get(Vector.I1)) * 1000);

            final Vector origin = new Vector(0, 0, 1);
            this.distPerpendicular = (int) startLocation.subtract(origin).cross(this.orientationVector).get(Vector.I3);

            this.distParallelStart = this.orientationVector.dot(startLocation);
            this.distParallelEnd = this.orientationVector.dot(endLocation);
        }

        public Vector getStartLocation() {
            return this.startLocation;
        }


        public Vector getEndLocation() {
            return this.endLocation;
        }


        public String getText() {
            return this.text;
        }

        public float getCharSpaceWidth() {
            return this.charSpaceWidth;
        }

        private void printDiagnostics() {
            System.out.println("Text (@" + this.startLocation + " -> " + this.endLocation + "): " + this.text);
            System.out.println("orientationMagnitude: " + this.orientationMagnitude);
            System.out.println("distPerpendicular: " + this.distPerpendicular);
            System.out.println("distParallel: " + this.distParallelStart);
        }


        public boolean sameLine(final TextChunk as) {
            if (this.orientationMagnitude != as.orientationMagnitude) {
                return false;
            }
            if (this.distPerpendicular != as.distPerpendicular) {
                return false;
            }
            return true;
        }


        public float distanceFromEndOf(final TextChunk other) {
            final float distance = this.distParallelStart - other.distParallelEnd;
            return distance;
        }

        public float myDistanceFromEndOf(final TextChunk other) {
            final float distance = this.distParallelStart - other.distParallelEnd;
            return distance;
        }


        @Override
        public int compareTo(final TextChunk rhs) {
            if (this == rhs) {
                return 0; // not really needed, but just in case
            }

            int rslt;
            rslt = compareInts(this.orientationMagnitude, rhs.orientationMagnitude);
            if (rslt != 0) {
                return rslt;
            }

            rslt = compareInts(this.distPerpendicular, rhs.distPerpendicular);
            if (rslt != 0) {
                return rslt;
            }

            return Float.compare(this.distParallelStart, rhs.distParallelStart);
        }

        private static int compareInts(final int int1, final int int2) {
            return int1 == int2 ? 0 : int1 < int2 ? -1 : 1;
        }


        public TextRenderInfo getInfo() {
            return this.info;
        }

    }


    @Override
    public void renderImage(final ImageRenderInfo renderInfo) {
        // do nothing
    }


    public static interface TextChunkFilter {

        public boolean accept(TextChunk textChunk);
    }


}

正如你所看到的，大部分与原始类相同。我刚刚添加了这个:

                final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
                for(int i = 0; i<Math.round(dist); i++) {
                    sb.append(' ');
                }

到 getResultantText 方法以用空格扩展间隙。但问题是:

距离似乎不准确或不精确。结果看起来像

this: 这个:

有人知道如何计算距离的更好值或值吗？我认为这是因为原始字体类型是 ArialMT 并且我的编辑器是 courier，但要使用此表，建议我可以将表格拆分到正确的位置以获取数据。由于值 usw 的 float 开始和结束，这很困难。

:-/

最佳答案

像这样插入空格的方法存在问题

            final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
            for(int i = 0; i<Math.round(dist); i++) {
                sb.append(' ');
            }

是它假设StringBuffer中的当前位置完全对应于lastChunk的末尾，假设字符宽度为3个用户空间单位。情况不一定如此，通常每次添加字符都会破坏以前的对应关系。例如。使用比例字体时，这两行的宽度不同:

ililili

MWMWMWM

在StringBuffer中它们占据相同的长度。

因此，您必须查看 block 相对于左页面边框的起始位置，并相应地向缓冲区添加空格。

此外，您的代码完全忽略行开头的可用空间。

如果您用以下代码替换原始方法 getResultantText(TextChunkFilter)，您的结果应该会有所改善:

public String getResultantText(TextChunkFilter chunkFilter){
    if (DUMP_STATE) dumpState();
    
    List<TextChunk> filteredTextChunks = filterTextChunks(locationalResult, chunkFilter);
    Collections.sort(filteredTextChunks);

    int startOfLinePosition = 0;
    StringBuffer sb = new StringBuffer();
    TextChunk lastChunk = null;
    for (TextChunk chunk : filteredTextChunks) {

        if (lastChunk == null){
            insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
            sb.append(chunk.text);
        } else {
            if (chunk.sameLine(lastChunk))
            {
                if (isChunkAtWordBoundary(chunk, lastChunk))
                {
                    insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, !startsWithSpace(chunk.text) && !endsWithSpace(lastChunk.text));
                }
                
                sb.append(chunk.text);
            } else {
                sb.append('\n');
                startOfLinePosition = sb.length();
                insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
                sb.append(chunk.text);
            }
        }
        lastChunk = chunk;
    }

    return sb.toString();       
}

void insertSpaces(StringBuffer sb, int startOfLinePosition, float chunkStart, boolean spaceRequired)
{
    int indexNow = sb.length() - startOfLinePosition;
    int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
    int spacesToInsert = indexToBe - indexNow;
    if (spacesToInsert < 1 && spaceRequired)
        spacesToInsert = 1;
    for (; spacesToInsert > 0; spacesToInsert--)
    {
        sb.append(' ');
    }
}

public float pageLeft = 0;
public float fixedCharWidth = 6;

pageLeft 是左页面边框的坐标。策略不知道这一点，因此必须明确告知；但在许多情况下，0 是正确的值。

或者可以使用所有 block 的最小 distParallelStart 值。这会切断左边距，但不需要您注入(inject)精确的左页面边框值。

fixedCharWidth 是假定的字符宽度。根据相关 PDF 中的书写情况，不同的值可能更合适。在你的例子中，值 3 似乎比我的 6 更好。

这段代码还有很大的改进空间。例如

它假设没有跨越多个表列的文本 block 。这种假设通常是正确的，但我见过奇怪的 PDF，其中正常的字间距是在某个偏移处使用单独的文本 block 实现的，但列间距由单个 block 中的单个空格字符表示(跨越一栏的结束和下一栏的开始)!该空格字符的宽度已由 PDF 图形状态的字间距设置控制。
它忽略不同数量的垂直空间。

关于java - IText 像 pdftotext -layout 一样读取 PDF？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24887784/

文章推荐： java - 从面板 izpack 执行 java 类

文章推荐： java - 使用多个属性对对象列表进行分组

文章推荐： java - 从 JFrame 返回一个值到 main()

itext - iText 包含哪些默认字体？
iText 文档指出它只包含特定的字体子集，但从未说明它们是什么。有没有人知道 iText 中默认包含哪些字体？ (我在网上搜索过，在任何地方都找不到这个字体列表!) 最佳答案它可能指的是PDF S
itext - IText 7 表格中的列宽问题
我使用固定列宽创建了下表，如下所示， Table headerTable = new Table(new float[]{5,5,5}); headerTable.setWidthPercent(
itext - 缩放图像以使用 iText 填充多个页面
我正在尝试使用 iText 缩放图像(在新的 PDF 文档上)以使其填充页面宽度而不拉伸(stretch)，这样它可能需要几页。我找到了很多解决方案，但它们都非常复杂，而且我真的不喜欢那样编码。到目
itext - Flying Saucer/iText
我正在使用 Flying Saucer/iText 生成报告。现在报告有一个条件，如果特定条件发生，报告应该移动到 pdf 的下一页，并在 PDF 上添加数据等等。问候帕万最佳答案您必须使用 c
itext - Flying Saucer/iText
我正在使用 Flying Saucer/iText 生成报告。现在报告有一个条件，如果特定条件发生，报告应该移动到 pdf 的下一页，并在 PDF 上添加数据等等。问候帕万最佳答案您必须使用 c
itext - 使用 iText 获取行位置
如何使用 iText 找到文档中的行的位置？假设我有一个 PDF 文档中的表格，并且想要阅读其中的内容；我想找到细胞的确切位置。为了做到这一点，我想我可能会找到线条的交点。最佳答案我认为您使用
itext - 使具有带有 itext 的滚动条的表的可编辑单元格只读
请找到下面的代码。 public class MakingFieldReadOnly implements PdfPCellEvent { /** The resulting PDF. */
itext - 在 iText 7 中编写文档时如何获得垂直光标位置？
在 iText 5 中有一个名为 getVerticalPosition() 的方法，它给出了下一个写入对象在页面上的位置。作为回答这个问题 How to find out the current c
itext - 在 TextField IText 中调整文本
抱歉，如果有类似我的帖子，但我是这个论坛的新手，我还没有找到它。我有动态调整 TextField 大小取决于文本大小的问题。我填写现有的 PDF - 在 AcroForm 中填写字段: form.s
itext - 要知道它是否是 ITEXT pdf 中的新页面
我正在使用 itext 生成 pdf。因此，当页面内容超出时，它会自动创建一个新页面。我想知道它是否创建了一个新页面。如果是，我想在页面顶部添加一些图像。 List paylist =new List
itext - 删除表格 iText java 的左右边距
我的有问题固定表格左侧和右侧的边距。我想删除该边距并使用没有边距或填充的所有工作表。我该怎么办？我刚刚试过这个，但对我不起作用: cell.setPaddingLeft(0); cell.se
itext - 如何使用 Itext 对齐段落(对齐)？
我有 2 行，我想对齐(证明)它们。我有这个代码: Paragraph p=new Paragraph(ANC,fontFootData); p.setLeading(1, 1);
itext - 使用外部服务和 iText 签署 PDF
我有这样的场景。我有一个生成 PDF 的应用程序，需要对其进行签名。我们没有用于签署文档的证书，因为它们位于 HSM 中，而我们使用证书的唯一方法是使用 Web 服务。此网络服务提供两个选项，发
itext - 如何实现 itext 7 表中列之间的空间？
我需要实现一个看起来像图片中的表格，列之间有空间。我试过: cell.setPaddingLeft(10); cell.setMarginLeft(10); extractio
itext - 如何实现 itext 7 表中列之间的空间？
我需要实现一个看起来像图片中的表格，列之间有空间。我试过: cell.setPaddingLeft(10); cell.setMarginLeft(10); extractio
itext - 使用 iText 将复选框添加到 PDF 文档
我需要使用 Java 的 iText 库创建一个 PDF 文档。我还需要包括一些复选框，这些复选框根据某些类变量的值打开/关闭。我找到了一些关于交互式表单的示例，但我不需要这种复杂程度:只需将一些复选
itext - 如何使用 iText PdfStamper 将内容添加到 PDF
我正在开发一个系统，我必须在其中将一些图像添加到现有的 PDF 文档中。这适用于 iText 5.1.3，但由于某种原因，在包含扫描图像的 PDF 中，它不会添加任何图像。这是 PDF Docum
itext - 使用 iText 提取 PDF 文本
我们正在研究信息提取，我们想使用iText。我们正在探索 iText。根据我们查阅过的文献，iText 是最好的工具。是否可以从 iText 中每行的 pdf 中提取文本？我在与我的相关的 stac
itext - 使用 iText 填充现有的 pdf 文本字段
我已经创建了一个带有一些文本字段的 pdf 文档。我可以使用 Adobe 阅读器填充这些文本字段并将这些值保存在该文件中。我的问题是，我可以使用 iText 以编程方式执行此操作吗？如果可能，请
itext - 如何摆脱 PdfPCell、iText 5 中的顶部填充
我正在使用 iText 5 表创建标签(如 Avery 标签)。标签元素的定位需要一些非常严格的公差，以便适合标签上的所有内容。我的问题是标签上有多个区域为 PdfPCells。我需要将文本放入这些区

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - IText 像 pdftotext -layout 一样读取 PDF？