gpt4 book ai didi

python - PyPDF2:复制 PDF 会产生空白页

转载 作者:行者123 更新时间:2023-12-03 19:30:04 28 4
gpt4 key购买 nike

我正在使用 PyPDF2更改 PDF 文档(添加书签)。所以我需要读入整个源 PDF 并将其写出来,尽可能多地保持数据完整。仅将每一页写入新的 PDF 对象可能不足以保留文档元数据。
PdfFileWriter()确实有许多复制整个文件的方法:cloneDocumentFromReader , appendPagesFromReadercloneReaderDocumentRoot .然而,他们都有问题。

如果我使用 cloneDocumentFromReaderappendPagesFromReader , 我得到一个有效的 PDF 文件,页数正确,但所有页面都是空白的。

如果我使用 cloneReaderDocumentRoot , 我得到一个最小的有效 PDF 文件,但没有页面或数据。

This has been asked before ,但没有成功的答案。
其他问题已询问 Blank pages in PyPDF2 ,但我不能应用给出的答案。

这是我的代码:

def bookmark(incomingFile):
fileObj = open(incomingFile, 'rb')
output = PdfFileWriter()
input = PdfFileReader(fileObj)

output.appendPagesFromReader(input)
#output.cloneDocumentFromReader(input)
myTableOfContents = [
('Page 1', 0),
('Page 2', 1),
('Page 3', 2)
]
# output.addBookmark(title, pagenum, parent=None, color=None, bold=False, italic=False, fit='/Fit')
for title, pagenum in myTableOfContents:
output.addBookmark(title, pagenum, parent=None)

output.setPageMode("/UseOutlines")

outputStream = open(incomingFile, "wb")
output.write(outputStream)
outputStream.close()
fileObj.close()

当 PyPDF2 无法向 PdfFileWriter 对象添加书签时,我往往会出错,因为它没有任何页面或类似内容。

最佳答案

我也纠结了很久,终于发现PyPDF2有这个issue .
基本上我复制了this answer's编码到C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py (这将取决于您的发行版)在 cloneDocumentFromReader 的第 382 行附近功能。

之后,我能够附加 reader页到 writerwriter.cloneDocumentFromReader(pdf)并且,就我而言,更新 PDF 元数据(主题、关键字等)。

希望这对你有帮助

    '''
Create a copy (clone) of a document from a PDF file reader

:param reader: PDF file reader instance from which the clone
should be created.
:callback after_page_append (function): Callback function that is invoked after
each page is appended to the writer. Signature includes a reference to the
appended page (delegates to appendPagesFromReader). Callback signature:

:param writer_pageref (PDF page reference): Reference to the page just
appended to the document.
'''
debug = False
if debug:
print("Number of Objects: %d" % len(self._objects))
for obj in self._objects:
print("\tObject is %r" % obj)
if hasattr(obj, "indirectRef") and obj.indirectRef != None:
print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))

# Variables used for after cloning the root to
# improve pre- and post- cloning experience

mustAddTogether = False
newInfoRef = self._info
oldPagesRef = self._pages
oldPages = self.getObject(self._pages)

# If there have already been any number of pages added

if oldPages[NameObject("/Count")] > 0:

# Keep them

mustAddTogether = True
else:

# Through the page object out

if oldPages in self._objects:
newInfoRef = self._pages
self._objects.remove(oldPages)

# Clone the reader's root document

self.cloneReaderDocumentRoot(reader)
if not self._root:
self._root = self._addObject(self._root_object)

# Sweep for all indirect references

externalReferenceMap = {}
self.stack = []
newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)

# Delete the stack to reset

del self.stack

#Clean-Up Time!!!

# Get the new root of the PDF

realRoot = self.getObject(newRootRef)

# Get the new pages tree root and its ID Number

tmpPages = realRoot[NameObject("/Pages")]
newIdNumForPages = 1 + self._objects.index(tmpPages)

# Make an IndirectObject just for the new Pages

self._pages = IndirectObject(newIdNumForPages, 0, self)

# If there are any pages to add back in

if mustAddTogether:

# Set the new page's root's parent to the old
# page's root's reference

tmpPages[NameObject("/Parent")] = oldPagesRef

# Add the reference to the new page's root in
# the old page's kids array

newPagesRef = self._pages
oldPages[NameObject("/Kids")].append(newPagesRef)

# Set all references to the root of the old/new
# page's root

self._pages = oldPagesRef
realRoot[NameObject("/Pages")] = oldPagesRef

# Update the count attribute of the page's root

oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])

else:

# Bump up the info's reference b/c the old
# page's tree was bumped off

self._info = newInfoRef

关于python - PyPDF2:复制 PDF 会产生空白页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55784897/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com