java - 使用 iText 替换 PDF 文件中的文本-6ren

java - 使用 iText 替换 PDF 文件中的文本

转载作者：塔克拉玛干更新时间：2023-11-03 05:23:45

我正在使用 iText(5.5.13) 库读取 .PDF 并替换文件中的模式。问题在于未找到该模式，因为在库读取 pdf 时不知何故出现了一些奇怪的字符。

例如，在句子中:

"This is a test in order to see if the"

当我试图阅读它时变成了这个:

[(This is a )9(te)-3(st)9( in o)-4(rd)15(er )-2(t)9(o)-5( s)8(ee)7( if t)-3(h)3(e )]

因此，如果我尝试查找并替换 "test"，则不会在 pdf 中找到 "test" 单词，并且不会被替换

这是我使用的代码:

public void processPDF(String src, String dest) {

    try {

      PdfReader reader = new PdfReader(src);
      PdfArray refs = null;
      PRIndirectReference reference = null;

      int nPages = reader.getNumberOfPages();

      for (int i = 1; i <= nPages; i++) {
        PdfDictionary dict = reader.getPageN(i);
        PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
        if (object.isArray()) {
          refs = dict.getAsArray(PdfName.CONTENTS);
          ArrayList<PdfObject> references = refs.getArrayList();

          for (PdfObject r : references) {

            reference = (PRIndirectReference) r;
            PRStream stream = (PRStream) PdfReader.getPdfObject(reference);
            byte[] data = PdfReader.getStreamBytes(stream);
            String dd = new String(data, "UTF-8");

            dd = dd.replaceAll("@pattern_1234", "trueValue");
            dd = dd.replaceAll("test", "tested");

            stream.setData(dd.getBytes());
          }

        }
        if (object instanceof PRStream) {
          PRStream stream = (PRStream) object;

          byte[] data = PdfReader.getStreamBytes(stream);
          String dd = new String(data, "UTF-8");
          System.out.println("content---->" + dd);
          dd = dd.replaceAll("@pattern_1234", "trueValue");
          dd = dd.replaceAll("This", "FIRST");

          stream.setData(dd.getBytes(StandardCharsets.UTF_8));
        }
      }
      PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
      stamper.close();
      reader.close();
    }

    catch (Exception e) {
    }
  }

最佳答案

正如评论和答案中已经提到的，PDF 不是一种用于文本编辑 的格式。它是最终格式，有关文本流、布局甚至到 Unicode 的映射的信息都是可选的。

因此，即使假设存在关于将字形映射到 Unicode 的可选信息，使用 iText 完成此任务的方法可能看起来有点不令人满意:首先使用自定义文本提取策略确定相关文本的位置，然后继续使用 PdfCleanUpProcessor 删除该位置所有内容的当前内容，最后将替换文本绘制到间隙中。

在这个答案中，我将提供一个辅助类，允许结合前两个步骤，查找和删除现有文本，其优点是确实只删除了文本，不还有任何背景图形等，如 PdfCleanUpProcessor 编辑。助手还返回已删除文本的位置，允许在其上标记替换。

帮助类基于 PdfContentStreamEditor this earlier answer .请使用the version of this class on github ，不过，因为原始类自构想以来已经得到了一些增强。

SimpleTextRemover helper 类说明了从 PDF 中正确删除文本的必要条件。实际上它有几个方面的局限性:

它只替换实际页面内容流中的文本。
要同时替换嵌入式 XObject 中的文本，必须递归地遍历相关页面的 XObject 资源，并对它们应用编辑器。
它与 SimpleTextExtractionStrategy 的“简单”方式相同:它假定显示说明的文本按阅读顺序出现在内容中。
同时处理顺序不同且指令必须排序的内容流，这意味着所有传入指令和相关呈现信息必须缓存到页面末尾，而不仅仅是一次几个指令.然后可以对渲染信息进行排序，可以在排序后的渲染信息中识别要删除的部分，可以操作相关指令，最终可以存储指令。
它不会尝试识别在视觉上代表空白的字形之间的间隙，而实际上根本没有字形。
要识别间隙，必须扩展代码以检查两个连续的字形是否完全相继出现，或者是否存在间隙或跳行。
在计算去除字形后留下的间隙时，它还没有考虑字符和单词的间距。
要改进这一点，必须改进字形宽度计算。

不过，考虑到您的内容流中的示例摘录，这些限制可能不会妨碍您。

public class SimpleTextRemover extends PdfContentStreamEditor {
    public SimpleTextRemover() {
        super (new SimpleTextRemoverListener());
        ((SimpleTextRemoverListener)getRenderListener()).simpleTextRemover = this;
    }

    /**
     * <p>Removes the string to remove from the given page of the
     * document in the PDF reader the given PDF stamper works on.</p>
     * <p>The result is a list of glyph lists each of which represents
     * a match can can be queried for position information.</p>
     */
    public List<List<Glyph>> remove(PdfStamper pdfStamper, int pageNum, String toRemove) throws IOException {
        if (toRemove.length()  == 0)
            return Collections.emptyList();

        this.toRemove = toRemove;
        cachedOperations.clear();
        elementNumber = -1;
        pendingMatch.clear();
        matches.clear();
        allMatches.clear();
        editPage(pdfStamper, pageNum);
        return allMatches;
    }

    /**
     * Adds the given operation to the cached operations and checks
     * whether some cached operations can meanwhile be processed and
     * written to the result content stream.
     */
    @Override
    protected void write(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        cachedOperations.add(new ArrayList<>(operands));

        while (process(processor)) {
            cachedOperations.remove(0);
        }
    }

    /**
     * Removes any started match and sends all remaining cached
     * operations for processing.
     */
    @Override
    public void finalizeContent() {
        pendingMatch.clear();
        try {
            while (!cachedOperations.isEmpty()) {
                if (!process(this)) {
                    // TODO: Should not happen, so warn
                    System.err.printf("Failure flushing operation %s; dropping.\n", cachedOperations.get(0));
                }
                cachedOperations.remove(0);
            }
        } catch (IOException e) {
            throw new ExceptionConverter(e);
        }
    }

    /**
     * Tries to process the first cached operation. Returns whether
     * it could be processed.
     */
    boolean process(PdfContentStreamProcessor processor) throws IOException {
        if (cachedOperations.isEmpty())
            return false;

        List<PdfObject> operands = cachedOperations.get(0);
        PdfLiteral operator = (PdfLiteral) operands.get(operands.size() - 1);
        String operatorString = operator.toString();

        if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            return processTextShowingOp(processor, operator, operands);

        super.write(processor, operator, operands);
        return true;
    }

    /**
     * Tries to processes a text showing operation. Unless a match
     * is pending and starts before the end of the argument of this
     * instruction, it can be processed. If the instructions contains
     * a part of a match, it is transformed to a TJ operation and
     * the glyphs in question are replaced by text position adjustments.
     * If the original operation had a side effect (jump to next line
     * or spacing adjustment), this side effect is explicitly added.
     */
    boolean processTextShowingOp(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        PdfObject object = operands.get(operands.size() - 2);
        boolean isArray = object instanceof PdfArray;
        PdfArray array = isArray ? (PdfArray) object : new PdfArray(object);
        int elementCount = countStrings(object);

        // Currently pending glyph intersects parameter of this operation -> cannot yet process
        if (!pendingMatch.isEmpty() && pendingMatch.get(0).elementNumber < processedElements + elementCount)
            return false;

        // The parameter of this operation is subject to a match -> copy as is
        if (matches.size() == 0 || processedElements + elementCount <= matches.get(0).get(0).elementNumber || elementCount == 0) {
            super.write(processor, operator, operands);
            processedElements += elementCount;
            return true;
        }

        // The parameter of this operation contains glyphs of a match -> manipulate 
        PdfArray newArray = new PdfArray();
        for (int arrayIndex = 0; arrayIndex < array.size(); arrayIndex++) {
            PdfObject entry = array.getPdfObject(arrayIndex);
            if (!(entry instanceof PdfString)) {
                newArray.add(entry);
            } else {
                PdfString entryString = (PdfString) entry;
                byte[] entryBytes = entryString.getBytes();
                for (int index = 0; index < entryBytes.length; ) {
                    List<Glyph> match = matches.size() == 0 ? null : matches.get(0);
                    Glyph glyph = match == null ? null : match.get(0);
                    if (glyph == null || processedElements < glyph.elementNumber) {
                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, entryBytes.length)));
                        break;
                    }
                    if (index < glyph.index) {
                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, glyph.index)));
                        index = glyph.index;
                        continue;
                    }
                    newArray.add(new PdfNumber(-glyph.width));
                    index++;
                    match.remove(0);
                    if (match.isEmpty())
                        matches.remove(0);
                }
                processedElements++;
            }
        }
        writeSideEffect(processor, operator, operands);
        writeTJ(processor, newArray);

        return true;
    }

    /**
     * Counts the strings in the given argument, itself a string or
     * an array containing strings and non-strings.
     */
    int countStrings(PdfObject textArgument) {
        if (textArgument instanceof PdfArray) {
            int result = 0;
            for (PdfObject object : (PdfArray)textArgument) {
                if (object instanceof PdfString)
                    result++;
            }
            return result;
        } else 
            return textArgument instanceof PdfString ? 1 : 0;
    }

    /**
     * Writes side effects of a text showing operation which is going to be
     * replaced by a TJ operation. Side effects are line jumps and changes
     * of character or word spacing.
     */
    void writeSideEffect(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        switch (operator.toString()) {
        case "\"":
            super.write(processor, OPERATOR_Tw, Arrays.asList(operands.get(0), OPERATOR_Tw));
            super.write(processor, OPERATOR_Tc, Arrays.asList(operands.get(1), OPERATOR_Tc));
        case "'":
            super.write(processor, OPERATOR_Tasterisk, Collections.singletonList(OPERATOR_Tasterisk));
        }
    }

    /**
     * Writes a TJ operation with the given array unless array is empty.
     */
    void writeTJ(PdfContentStreamProcessor processor, PdfArray array) throws IOException {
        if (!array.isEmpty()) {
            List<PdfObject> operands = Arrays.asList(array, OPERATOR_TJ);
            super.write(processor, OPERATOR_TJ, operands);
        }
    }

    /**
     * Analyzes the given text render info whether it starts a new match or
     * finishes / continues / breaks a pending match. This method is called
     * by the {@link SimpleTextRemoverListener} registered as render listener
     * of the underlying content stream processor.
     */
    void renderText(TextRenderInfo renderInfo) {
        elementNumber++;
        int index = 0;
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos()) {
            int matchPosition = pendingMatch.size();
            pendingMatch.add(new Glyph(info, elementNumber, index));
            if (!toRemove.substring(matchPosition, matchPosition + info.getText().length()).equals(info.getText())) {
                reduceToPartialMatch();
            }
            if (pendingMatch.size() == toRemove.length()) {
                matches.add(new ArrayList<>(pendingMatch));
                allMatches.add(new ArrayList<>(pendingMatch));
                pendingMatch.clear();
            }
            index++;
        }
    }

    /**
     * Reduces the current pending match to an actual (partial) match
     * after the addition of the next glyph has invalidated it as a
     * whole match.
     */
    void reduceToPartialMatch() {
        outer:
        while (!pendingMatch.isEmpty()) {
            pendingMatch.remove(0);
            int index = 0;
            for (Glyph glyph : pendingMatch) {
                if (!toRemove.substring(index, index + glyph.text.length()).equals(glyph.text)) {
                    continue outer;
                }
                index++;
            }
            break;
        }
    }

    String toRemove = null;
    final List<List<PdfObject>> cachedOperations = new LinkedList<>();

    int elementNumber = -1;
    int processedElements = 0;
    final List<Glyph> pendingMatch = new ArrayList<>();
    final List<List<Glyph>> matches = new ArrayList<>();
    final List<List<Glyph>> allMatches = new ArrayList<>();

    /**
     * Render listener class used by {@link SimpleTextRemover} as listener
     * of its content stream processor ancestor. Essentially it forwards
     * {@link TextRenderInfo} events and ignores all else.
     */
    static class SimpleTextRemoverListener implements RenderListener {
        @Override
        public void beginTextBlock() { }

        @Override
        public void renderText(TextRenderInfo renderInfo) {
            simpleTextRemover.renderText(renderInfo);
        }

        @Override
        public void endTextBlock() { }

        @Override
        public void renderImage(ImageRenderInfo renderInfo) { }

        SimpleTextRemover simpleTextRemover = null;
    }

    /**
     * Value class representing a glyph with information on
     * the displayed text and its position, the overall number
     * of the string argument of a text showing instruction
     * it is in and the index at which it can be found therein,
     * and the width to use as text position adjustment when
     * replacing it. Beware, the width does not yet consider
     * character and word spacing!
     */
    public static class Glyph {
        public Glyph(TextRenderInfo info, int elementNumber, int index) {
            text = info.getText();
            ascent = info.getAscentLine();
            base = info.getBaseline();
            descent = info.getDescentLine();
            this.elementNumber = elementNumber;
            this.index = index;
            this.width = info.getFont().getWidth(text);
        }

        public final String text;
        public final LineSegment ascent;
        public final LineSegment base;
        public final LineSegment descent;
        final int elementNumber;
        final int index;
        final float width;
    }

    final PdfLiteral OPERATOR_Tasterisk = new PdfLiteral("T*");
    final PdfLiteral OPERATOR_Tc = new PdfLiteral("Tc");
    final PdfLiteral OPERATOR_Tw = new PdfLiteral("Tw");
    final PdfLiteral OPERATOR_Tj = new PdfLiteral("Tj");
    final PdfLiteral OPERATOR_TJ = new PdfLiteral("TJ");
    final static List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    final static Glyph[] EMPTY_GLYPH_ARRAY = new Glyph[0];
}

( SimpleTextRemover 辅助类)

你可以这样使用它:

PdfReader pdfReader = new PdfReader(SOURCE);
PdfStamper pdfStamper = new PdfStamper(pdfReader, RESULT_STREAM);
SimpleTextRemover remover = new SimpleTextRemover();

System.out.printf("\ntest.pdf - Test\n");
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
    System.out.printf("Page %d:\n", i);
    List<List<Glyph>> matches = remover.remove(pdfStamper, i, "Test");
    for (List<Glyph> match : matches) {
        Glyph first = match.get(0);
        Vector baseStart = first.base.getStartPoint();
        Glyph last = match.get(match.size()-1);
        Vector baseEnd = last.base.getEndPoint();
        System.out.printf("  Match from (%3.1f %3.1f) to (%3.1f %3.1f)\n", baseStart.get(I1), baseStart.get(I2), baseEnd.get(I1), baseEnd.get(I2));
    }
}

pdfStamper.close();

( RemovePageTextContent 测试 testRemoveTestFromTest)

我的测试文件的控制台输出如下:

test.pdf - Test
Page 1:
  Match from (134,8 666,9) to (177,8 666,9)
  Match from (134,8 642,0) to (153,4 642,0)
  Match from (172,8 642,0) to (191,4 642,0)

以及输出 PDF 中这些位置缺少“测试”的情况。

您可以使用它们在相关位置绘制替换文本，而不是输出匹配坐标。

关于java - 使用 iText 替换 PDF 文件中的文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57308588/

文章推荐： excel - 使用 Excel 进行二维搜索

文章推荐： java - ReadWriteLock 需要 ConcurrentHashMap 吗？

文章推荐： java - 如何将一个类注入(inject)到java.lang包中

itext - iText 包含哪些默认字体？
iText 文档指出它只包含特定的字体子集，但从未说明它们是什么。有没有人知道 iText 中默认包含哪些字体？ (我在网上搜索过，在任何地方都找不到这个字体列表!) 最佳答案它可能指的是PDF S
itext - IText 7 表格中的列宽问题
我使用固定列宽创建了下表，如下所示， Table headerTable = new Table(new float[]{5,5,5}); headerTable.setWidthPercent(
itext - 缩放图像以使用 iText 填充多个页面
我正在尝试使用 iText 缩放图像(在新的 PDF 文档上)以使其填充页面宽度而不拉伸(stretch)，这样它可能需要几页。我找到了很多解决方案，但它们都非常复杂，而且我真的不喜欢那样编码。到目
itext - Flying Saucer/iText
我正在使用 Flying Saucer/iText 生成报告。现在报告有一个条件，如果特定条件发生，报告应该移动到 pdf 的下一页，并在 PDF 上添加数据等等。问候帕万最佳答案您必须使用 c
itext - Flying Saucer/iText
我正在使用 Flying Saucer/iText 生成报告。现在报告有一个条件，如果特定条件发生，报告应该移动到 pdf 的下一页，并在 PDF 上添加数据等等。问候帕万最佳答案您必须使用 c
itext - 使用 iText 获取行位置
如何使用 iText 找到文档中的行的位置？假设我有一个 PDF 文档中的表格，并且想要阅读其中的内容；我想找到细胞的确切位置。为了做到这一点，我想我可能会找到线条的交点。最佳答案我认为您使用
itext - 使具有带有 itext 的滚动条的表的可编辑单元格只读
请找到下面的代码。 public class MakingFieldReadOnly implements PdfPCellEvent { /** The resulting PDF. */
itext - 在 iText 7 中编写文档时如何获得垂直光标位置？
在 iText 5 中有一个名为 getVerticalPosition() 的方法，它给出了下一个写入对象在页面上的位置。作为回答这个问题 How to find out the current c
itext - 在 TextField IText 中调整文本
抱歉，如果有类似我的帖子，但我是这个论坛的新手，我还没有找到它。我有动态调整 TextField 大小取决于文本大小的问题。我填写现有的 PDF - 在 AcroForm 中填写字段: form.s
itext - 要知道它是否是 ITEXT pdf 中的新页面
我正在使用 itext 生成 pdf。因此，当页面内容超出时，它会自动创建一个新页面。我想知道它是否创建了一个新页面。如果是，我想在页面顶部添加一些图像。 List paylist =new List
itext - 删除表格 iText java 的左右边距
我的有问题固定表格左侧和右侧的边距。我想删除该边距并使用没有边距或填充的所有工作表。我该怎么办？我刚刚试过这个，但对我不起作用: cell.setPaddingLeft(0); cell.se
itext - 如何使用 Itext 对齐段落(对齐)？
我有 2 行，我想对齐(证明)它们。我有这个代码: Paragraph p=new Paragraph(ANC,fontFootData); p.setLeading(1, 1);
itext - 使用外部服务和 iText 签署 PDF
我有这样的场景。我有一个生成 PDF 的应用程序，需要对其进行签名。我们没有用于签署文档的证书，因为它们位于 HSM 中，而我们使用证书的唯一方法是使用 Web 服务。此网络服务提供两个选项，发
itext - 如何实现 itext 7 表中列之间的空间？
我需要实现一个看起来像图片中的表格，列之间有空间。我试过: cell.setPaddingLeft(10); cell.setMarginLeft(10); extractio
itext - 如何实现 itext 7 表中列之间的空间？
我需要实现一个看起来像图片中的表格，列之间有空间。我试过: cell.setPaddingLeft(10); cell.setMarginLeft(10); extractio
itext - 使用 iText 将复选框添加到 PDF 文档
我需要使用 Java 的 iText 库创建一个 PDF 文档。我还需要包括一些复选框，这些复选框根据某些类变量的值打开/关闭。我找到了一些关于交互式表单的示例，但我不需要这种复杂程度:只需将一些复选
itext - 如何使用 iText PdfStamper 将内容添加到 PDF
我正在开发一个系统，我必须在其中将一些图像添加到现有的 PDF 文档中。这适用于 iText 5.1.3，但由于某种原因，在包含扫描图像的 PDF 中，它不会添加任何图像。这是 PDF Docum
itext - 使用 iText 提取 PDF 文本
我们正在研究信息提取，我们想使用iText。我们正在探索 iText。根据我们查阅过的文献，iText 是最好的工具。是否可以从 iText 中每行的 pdf 中提取文本？我在与我的相关的 stac
itext - 使用 iText 填充现有的 pdf 文本字段
我已经创建了一个带有一些文本字段的 pdf 文档。我可以使用 Adobe 阅读器填充这些文本字段并将这些值保存在该文件中。我的问题是，我可以使用 iText 以编程方式执行此操作吗？如果可能，请
itext - 如何摆脱 PdfPCell、iText 5 中的顶部填充
我正在使用 iText 5 表创建标签(如 Avery 标签)。标签元素的定位需要一些非常严格的公差，以便适合标签上的所有内容。我的问题是标签上有多个区域为 PdfPCells。我需要将文本放入这些区

塔克拉玛干

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - 使用 iText 替换 PDF 文件中的文本