gpt4 book ai didi

python - 以编程方式阅读、突出显示、保存 PDF

转载 作者:IT王子 更新时间:2023-10-29 00:01:43 25 4
gpt4 key购买 nike

我想编写一个小脚本(将在 headless Linux 服务器上运行)来读取 PDF,突出显示与我传递的字符串数组中的任何内容匹配的文本,然后保存修改后的 PDF。我想我最终会使用类似 python bindings to poppler 的东西但不幸的是,文档几乎为零,而我在 python 方面的经验几乎为零。

如果有人能给我指点教程、示例或一些有用的文档来帮助我入门,我将不胜感激!

最佳答案

是的,结合使用 pdfminer (pip install pdfminer.six) 和 PyPDF2 是可能的。

首先,找到坐标(例如 this )。然后突出显示它:

#!/usr/bin/env python

"""Create sample highlight in a PDF file."""

from PyPDF2 import PdfFileWriter, PdfFileReader

from PyPDF2.generic import (
DictionaryObject,
NumberObject,
FloatObject,
NameObject,
TextStringObject,
ArrayObject
)


def create_highlight(x1, y1, x2, y2, meta, color=[0, 1, 0]):
"""
Create a highlight for a PDF.

Parameters
----------
x1, y1 : float
bottom left corner
x2, y2 : float
top right corner
meta : dict
keys are "author" and "contents"
color : iterable
Three elements, (r,g,b)
"""
new_highlight = DictionaryObject()

new_highlight.update({
NameObject("/F"): NumberObject(4),
NameObject("/Type"): NameObject("/Annot"),
NameObject("/Subtype"): NameObject("/Highlight"),

NameObject("/T"): TextStringObject(meta["author"]),
NameObject("/Contents"): TextStringObject(meta["contents"]),

NameObject("/C"): ArrayObject([FloatObject(c) for c in color]),
NameObject("/Rect"): ArrayObject([
FloatObject(x1),
FloatObject(y1),
FloatObject(x2),
FloatObject(y2)
]),
NameObject("/QuadPoints"): ArrayObject([
FloatObject(x1),
FloatObject(y2),
FloatObject(x2),
FloatObject(y2),
FloatObject(x1),
FloatObject(y1),
FloatObject(x2),
FloatObject(y1)
]),
})

return new_highlight


def add_highlight_to_page(highlight, page, output):
"""
Add a highlight to a PDF page.

Parameters
----------
highlight : Highlight object
page : PDF page object
output : PdfFileWriter object
"""
highlight_ref = output._addObject(highlight)

if "/Annots" in page:
page[NameObject("/Annots")].append(highlight_ref)
else:
page[NameObject("/Annots")] = ArrayObject([highlight_ref])


def main():
pdf_input = PdfFileReader(open("samples/test3.pdf", "rb"))
pdf_output = PdfFileWriter()

page1 = pdf_input.getPage(0)

highlight = create_highlight(89.9206, 573.1283, 376.849, 591.3563, {
"author": "John Doe",
"contents": "Lorem ipsum"
})

add_highlight_to_page(highlight, page1, pdf_output)

pdf_output.addPage(page1)

output_stream = open("output.pdf", "wb")
pdf_output.write(output_stream)


if __name__ == '__main__':
main()

关于python - 以编程方式阅读、突出显示、保存 PDF,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7605577/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com