python - PyPDF2:复制 PDF 会产生空白页-6ren

python - PyPDF2:复制 PDF 会产生空白页

转载作者：行者123 更新时间：2023-12-03 19:30:04

28

4

我正在使用 PyPDF2更改 PDF 文档(添加书签)。所以我需要读入整个源 PDF 并将其写出来，尽可能多地保持数据完整。仅将每一页写入新的 PDF 对象可能不足以保留文档元数据。
PdfFileWriter()确实有许多复制整个文件的方法:cloneDocumentFromReader , appendPagesFromReader和 cloneReaderDocumentRoot .然而，他们都有问题。

如果我使用 cloneDocumentFromReader或 appendPagesFromReader , 我得到一个有效的 PDF 文件，页数正确，但所有页面都是空白的。

如果我使用 cloneReaderDocumentRoot , 我得到一个最小的有效 PDF 文件，但没有页面或数据。

This has been asked before ，但没有成功的答案。
其他问题已询问 Blank pages in PyPDF2 ，但我不能应用给出的答案。

这是我的代码:

def bookmark(incomingFile):
    fileObj = open(incomingFile, 'rb')
    output = PdfFileWriter()
    input = PdfFileReader(fileObj)

    output.appendPagesFromReader(input)
    #output.cloneDocumentFromReader(input)
    myTableOfContents = [
            ('Page 1', 0), 
            ('Page 2', 1),
            ('Page 3', 2)
            ]
    # output.addBookmark(title, pagenum, parent=None, color=None, bold=False, italic=False, fit='/Fit')
    for title, pagenum in myTableOfContents:
        output.addBookmark(title, pagenum, parent=None)

    output.setPageMode("/UseOutlines")

    outputStream = open(incomingFile, "wb")
    output.write(outputStream)
    outputStream.close()
    fileObj.close()

当 PyPDF2 无法向 PdfFileWriter 对象添加书签时，我往往会出错，因为它没有任何页面或类似内容。

最佳答案

我也纠结了很久，终于发现PyPDF2有这个issue .
基本上我复制了this answer's编码到C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py (这将取决于您的发行版)在 cloneDocumentFromReader 的第 382 行附近功能。

之后，我能够附加 reader页到 writer与 writer.cloneDocumentFromReader(pdf)并且，就我而言，更新 PDF 元数据(主题、关键字等)。

希望这对你有帮助

    '''
    Create a copy (clone) of a document from a PDF file reader

    :param reader: PDF file reader instance from which the clone
        should be created.
    :callback after_page_append (function): Callback function that is invoked after
        each page is appended to the writer. Signature includes a reference to the
        appended page (delegates to appendPagesFromReader). Callback signature:

        :param writer_pageref (PDF page reference): Reference to the page just
            appended to the document.
    '''
    debug = False
    if debug:
        print("Number of Objects: %d" % len(self._objects))
        for obj in self._objects:
            print("\tObject is %r" % obj)
            if hasattr(obj, "indirectRef") and obj.indirectRef != None:
                print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))

    # Variables used for after cloning the root to
    # improve pre- and post- cloning experience

    mustAddTogether = False
    newInfoRef = self._info
    oldPagesRef = self._pages
    oldPages = self.getObject(self._pages)

    # If there have already been any number of pages added

    if oldPages[NameObject("/Count")] > 0:

        # Keep them

        mustAddTogether = True
    else:

        # Through the page object out

        if oldPages in self._objects:
            newInfoRef = self._pages
            self._objects.remove(oldPages)

    # Clone the reader's root document

    self.cloneReaderDocumentRoot(reader)
    if not self._root:
        self._root = self._addObject(self._root_object)

    # Sweep for all indirect references

    externalReferenceMap = {}
    self.stack = []
    newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)

    # Delete the stack to reset

    del self.stack

    #Clean-Up Time!!!

    # Get the new root of the PDF

    realRoot = self.getObject(newRootRef)

    # Get the new pages tree root and its ID Number

    tmpPages = realRoot[NameObject("/Pages")]
    newIdNumForPages = 1 + self._objects.index(tmpPages)

    # Make an IndirectObject just for the new Pages

    self._pages = IndirectObject(newIdNumForPages, 0, self)

    # If there are any pages to add back in

    if mustAddTogether:

        # Set the new page's root's parent to the old
        # page's root's reference

        tmpPages[NameObject("/Parent")] = oldPagesRef

        # Add the reference to the new page's root in
        # the old page's kids array

        newPagesRef = self._pages
        oldPages[NameObject("/Kids")].append(newPagesRef)

        # Set all references to the root of the old/new
        # page's root

        self._pages = oldPagesRef
        realRoot[NameObject("/Pages")] = oldPagesRef

        # Update the count attribute of the page's root

        oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])

    else:

        # Bump up the info's reference b/c the old
        # page's tree was bumped off

        self._info = newInfoRef

关于python - PyPDF2:复制 PDF 会产生空白页，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55784897/

28

4

0

文章推荐： apache-flink - 如何将 Apache Flink 与查找数据一起使用？

文章推荐： php - PHP模块和PHP扩展有什么区别？

javascript - setTimeOut 产生 233 fps 而 requestAnimationFrame 产生 61
我在 Chrome 上做了一些测试，requestAnimationFrame 产生了 61 fps 而 setTimeOut( callback, 0 ) 产生了 233 fps。如果一个人想要超
c++ - 为什么 GCC 为 0.0/0.0 产生 -nan 和 clang 和 intel 产生 +nan？
当我调试代码时，我发现 GCC 和 Clang 都为 0.0/0.0 产生 nan，这是我所期望的，但 GCC 产生的 nan 将符号位设置为 1，而Clang 将其设置为 0(如果我没记错的话，与
audio - 产生.WAV声音频率？
Closed. This question does not meet Stack Overflow guidelines。它当前不接受答案。想改善这个问题吗？更新问题，以便将其作为on-topic
R:产生 NaN
我在 R Studio 中有一个时间序列。现在我想计算这个系列的log()。我尝试了以下方法: i <- (x-y) ii <- log(i) 但是我得到以下信息:Warning message: I
javascript - 为什么 (![]+[])[+!![]+[]] 产生 "a"
我有兴趣了解 JavaScript 的内部结构.我试图阅读 SpiderMonkey 的来源和 Rhino但是绕过我的头是相当复杂的。我问的原因是:为什么像 (![]+[])[+!![]+[]] 生
delphi - MSHTML PasteHTML() 产生
我们在 Delphi 中使用标准 TWebbrowser 组件，该组件在内部使用 mshtml.dll。另外，我们使用注册表来确保页面使用新的渲染引擎( Web-Browser-Control-Spe
c# - 产生 IList 返回类型
我必须实现一个序列化/反序列化类，并且我正在使用 System.Xml.Serialization 。我有一些IList类型属性并希望在 IList 中序列化解码属于具有特定区域性信息的列表的所有十进
java - 产生 5 万个线程的可扩展性指南
我有一个 Java 应用程序，它读取包含 SQL 查询的 JSON 文件，并使用 JDBC 在数据库上触发它们。现在我有 5 万个这样的文件，我需要生成 5 万个独立线程来读取每个文件并将它们上传到
python - Tensorflow 产生 NaN
我正在尝试将 TensorFlow 入门页面上的示例线性回归程序调整为二次回归。为此，我只是添加了另一个变量并更改了函数。然而，这似乎会导致 NaN 值。这是我的代码: import numpy as
python - KernelPCA 产生 NaN
申请后KernelPCA到我的数据并将其传递给分类器 ( SVC ) 我收到以下错误: ValueError: Input contains NaN, infinity or a value too
java - 产生 IllegalStateException 的基于登录的应用程序
这背后的想法是，如果我的数据库中存在登录名(正确的用户名+密码)，我将重定向到一个页面，并且在进行此身份验证后，他们可以将消息存储在文本文件中。代码非常简单尽管我不确定为什么会收到 IllegalSt
python - 产生 OverflowError 的十进制数的幂
我有一个返回 log10 值的函数。在将它们转换为正常数字时，出现溢出错误。 OverflowError: (34, 'Numerical result out of range') 我检查了日志值，
python - nosetests 产生 ImportError
nosetests 抛出一个 ImportError，尽管我认为这是一个正确配置的 virtualenv。 ==============================================
python - ScrollLabel 产生 ValueError
我是这个网站的新手，所以如果我做错了什么，我提前道歉。当我尝试使用 kivy-garden 的 ScrollLabel 时，它给了我一个错误。基本上我正在尝试创建一个控制台日志，并且我需要能够在文本框
Java MDSJ 产生 NaN
任何人都对 MDSJ 有任何经验？以下输入仅产生 NaN 结果，我不明白为什么。文档非常稀少。 import mdsj.Data; import mdsj.MDSJ; public class MDS
java - cuMemcpyDtoH 产生 CUDA_ERROR_INVALID_VALUE
我有一个非常简单的 scala jcuda 程序，它添加了一个非常大的数组。一切都编译和运行得很好，直到我想从我的设备复制超过 4 个字节到主机。当我尝试复制超过 4 个字节时，我收到 CUDA_ER
flutter - 产生 RenderBox 溢出的英雄动画
我正在使用 Hero 组件在两个页面之间创建动画。Hero 组件用于包装一个 Image 小部件(没问题)和一个 Container 小部件(有问题)。抛出以下溢出错误: ══╡ EXCEPTIO
javascript - 产生*副作用*的表达式到底是什么？
我无法理解页面 https://developer.mozilla.org/en/JavaScript/Reference/Operators/Special/void 中的这一段: This ope
angular - asynsPipe 产生 null 作为第一个值
当在 Angular 中使用不立即触发事件的异步管道时(http 请求或任何有延迟的可观察对象)，第一个值为 null为什么会这样？如何避免这种情况？第一个变化: SimpleChange {
go - 产生 goroutines 的库中的 panic
如果一个导入的库生成了一个会 panic 的 goroutine 怎么办？在这种情况下，开发人员无法阻止程序退出。就像在这段代码中一样，使用延迟恢复调用一个错误的库没有帮助，因为该库正在生成一个 p

首页

博学

6Ren·AI

商城

python - PyPDF2:复制 PDF 会产生空白页