pdf - 将 PDF 文本转换为轮廓？

转载作者：行者123 更新时间：2023-12-02 11:11:11

57

4

有人知道如何对 PDF 文档中的文本进行矢量化吗？也就是说，我希望每个字母都是一个形状/轮廓，没有任何文字内容。我使用的是 Linux 系统，首选开源或非 Windows 解决方案。

上下文:我正在尝试编辑一些旧的 PDF，但我不再拥有这些字体。我想在 Inkscape 中执行此操作，但这会将所有字体替换为通用字体，而且几乎不可读。我还使用 pdf2ps 和 ps2pdf 来回转换，但字体信息保留在那里。因此，当我将其加载到 Inkscape 中时，它看起来仍然很糟糕。

有什么想法吗？谢谢。

最佳答案

要实现这一目标，您必须:

将 PDF 拆分为单独的页面；
将 PDF 页面转换为 SVG；
编辑您想要的页面
重新组合页面

这个答案将省略第 3 步，因为它不可编程。

分割 PDF

如果您不希望以编程方式分割文档，现代的方法是使用 stapler 。在您最喜欢的 shell 中:

stapler burst file.pdf

将生成 {file_1.pdf,...,file_N.pdf}，其中 1...N 是 PDF 页面。订书机本身使用PyPDF2分割PDF文件的代码并不复杂。以下函数分割文件并将各个页面保存在当前目录中。 (无耻地从 commands.py 文件复制)

import math
import os
from PyPDF2 import PdfFileWriter, PdfFileReader

def split(filename):
    with open(filename) as inputfp:
        inputpdf = PdfFileReader(inputfp)

        base, ext = os.path.splitext(os.path.basename(filename))

        # Prefix the output template with zeros so that ordering is preserved
        # (page 10 after page 09)
        output_template = ''.join([
            base,
            '_',
            '%0',
            str(math.ceil(math.log10(inputpdf.getNumPages()))),
            'd',
            ext
        ])

        for page in range(inputpdf.getNumPages()):
            outputpdf = PdfFileWriter()
            outputpdf.addPage(inputpdf.getPage(page))

            outputname = output_template % (page + 1)

            with open(outputname, 'wb') as fp:
                outputpdf.write(fp)

将各个页面转换为 SVG

现在要将 PDF 转换为可编辑文件，我可能会使用 pdf2svg .

pdf2svg input.pdf output.svg

如果我们看一下 pdf2svg.c文件中，我们可以看到代码原则上并没有那么复杂(假设输入文件名在 filename 变量中，输出文件名在 outputname 变量中)。下面是一个 Python 中的最小工作示例。它需要 pycairo和 pypoppler图书馆:

import os

import cairo
import poppler

def convert(inputname, outputname):
    # Convert the input file name to an URI to please poppler
    uri = 'file://' + os.path.abspath(inputname)

    pdffile = poppler.document_new_from_file(uri, None)

    # We only have one page, since we split prior to converting. Get the page
    page = pdffile.get_page(0)

    # Get the page dimensions
    width, height = page.get_size()

    # Open the SVG file to write on
    surface = cairo.SVGSurface(outputname, width, height)
    context = cairo.Context(surface)

    # Now we finally can render the PDF to SVG
    page.render_for_printing(context)
    context.show_page()

此时，您应该拥有一个 SVG，其中所有文本都已转换为路径，并且能够使用 Inkscape 进行编辑，而不会出现渲染问题。

结合步骤 1 和 2

您可以在 for 循环中调用 pdf2svg 来执行此操作。但您需要事先知道页数。下面的代码计算页数并在一个步骤中完成转换。它只需要 pycairo 和 pypoppler:

import os, math

import cairo
import poppler

def convert(inputname, base=None):
    '''Converts a multi-page PDF to multiple SVG files.

    :param inputname: Name of the PDF to be converted
    :param base: Base name for the SVG files (optional)
    '''
    if base is None:
        base, ext = os.path.splitext(os.path.basename(inputname))

    # Convert the input file name to an URI to please poppler
    uri = 'file://' + os.path.abspath(inputname)

    pdffile = poppler.document_new_from_file(uri, None)

    pages = pdffile.get_n_pages()

    # Prefix the output template with zeros so that ordering is preserved
    # (page 10 after page 09)
    output_template = ''.join([
        base,
        '_',
        '%0',
        str(math.ceil(math.log10(pages))),
        'd',
        '.svg'
    ])

    # Iterate over all pages
    for nthpage in range(pages):
        page = pdffile.get_page(nthpage)

        # Output file name based on template
        outputname = output_template % (nthpage + 1)

        # Get the page dimensions
        width, height = page.get_size()

        # Open the SVG file to write on
        surface = cairo.SVGSurface(outputname, width, height)
        context = cairo.Context(surface)

        # Now we finally can render the PDF to SVG
        page.render_for_printing(context)
        context.show_page()

        # Free some memory
        surface.finish()

将 SVG 组装成单个 PDF

要重新组装，您可以使用 inkscape/订书机对手动转换文件。但编写执行此操作的代码并不难。下面的代码使用 rsvg 和 cairo。要从 SVG 进行转换并将所有内容合并到一个 PDF 中:

import rsvg
import cairo

def convert_merge(inputfiles, outputname):
    # We have to create a PDF surface and inform a size. The size is
    # irrelevant, though, as we will define the sizes of each page
    # individually.
    outputsurface = cairo.PDFSurface(outputname, 1, 1)
    outputcontext = cairo.Context(outputsurface)

    for inputfile in inputfiles:
        # Open the SVG
        svg = rsvg.Handle(file=inputfile)

        # Set the size of the page itself
        outputsurface.set_size(svg.props.width, svg.props.height)

        # Draw on the PDF
        svg.render_cairo(outputcontext)

        # Finish the page and start a new one
        outputcontext.show_page()

    # Free some memory
    outputsurface.finish()

PS:应该可以使用命令pdftocairo，但它似乎没有调用render_for_printing()，这使得输出的SVG保持字体信息.

关于pdf - 将 PDF 文本转换为轮廓？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26855026/

57

4

0

文章推荐： Restful 与其他 Web 服务

文章推荐： compiler-errors - VHDL错误(预期为简单表达式)

文章推荐： java - java编译错误: -Xlint causing problems

文章推荐： Java Arraylist清除: how clear() works

html - 在没有模糊效果的情况下在元素周围均匀地生成阴影/轮廓
我有一个不规则形状的元素(比方说图标)。我想要围绕它的某种轮廓，以符合特定颜色的形状。此轮廓的颜色必须均匀地围绕形状，即与形状各处的距离相同，并且没有颜色渐变。我发现使用的是 css 选项 fil
c++ - OpenCV 轮廓？
这部分代码我总是出错 &contours = ((contours.h_next) -> h_next); contours.h_next = ((contours.h_next) -> h_next
css - 更正形状的不规则边框/轮廓
我通过 css (:after) 创建了 3 个圆圈，使用一些背景颜色，边框看起来不规则。有什么解决办法吗？在这里您可以看到问题:https://flowersliving.com/cpt_01/a
css - 渐变边框上的边框(轮廓)
使用这个: background: -moz-linear-gradient(315deg, transparent 10px, black 10px); 如何在不使用 border 的情况下围绕它创
二进制二维矩阵的 python 轮廓
我想计算二元 NxM 矩阵中某个形状周围的凸包。凸包算法需要一个坐标列表，所以我采用 numpy.argwhere(im) 来获得所有形状点坐标。然而，这些点中的大多数对凸包没有贡献(它们位于形状的内
css - 删除焦点下拉菜单的虚线边框/轮廓
如何删除从下拉菜单中选择元素时显示的虚线边框/轮廓？您可以看到显示了虚线边框/轮廓，我想删除它(在 Firefox 中截取的屏幕截图)。尝试下面的解决方案并没有删除它: select:focus,
css - 如何绘制半圆(仅限边框、轮廓)
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。这个问题似乎是题外话，因为它缺乏足够的信息来诊断问题。更详细地描述您的问题或include a min
Qt4:缩放不变的 qgraphicsitem 轮廓
我正在使用 Qt4 GraphicsView 框架绘制一些多边形，并且允许用户放大和缩小绘图。我希望多边形随着用户在 View 中更改缩放级别(比例)而变得越来越小，但是有没有办法使轮廓厚度始终保持不
点列表的 3D 轮廓(凹包)
我在 C# 中有一个 Vector3 点列表，我需要计算这些点的凹轮廓。确实有很多引用资料，尤其是 -convex- 分辨率(我已经成功实现了，多亏了 graham 的算法)，但是，由于我现在需要有
java - 基于运输时间的热图/轮廓(反向等时轮廓)
注: r 中的解决方案, python , java ，或者如果需要，c++或 c#是需要的。我正在尝试根据运输时间绘制轮廓。更清楚地说，我想将具有相似旅行时间(比如说 10 分钟间隔)的点聚集到特
python - 在另一个图像上匹配轮廓或绘制 (png) 轮廓
假设我在图像上找到了轮廓。在图像 2 上找到此轮廓位置的最佳方法是什么？我看到两个选项:要么我用白线绘制轮廓并匹配图像 2 上的图像，要么我以某种方式(这甚至可能吗？)直接匹配图像 2 上的轮廓。
python - len(轮廓)是什么意思？
我一直在研究细菌的图像，希望从图像中获取细菌的数量，还需要根据特定的形状和大小对细菌进行分类。我正在使用opencv python。现在，我使用轮廓法。 contours,hierarchy
python - 如何在OpenCV中区分实心圆/轮廓和未实心圆/轮廓？
我无法区分以下两个轮廓。 cv2.contourArea两者的值相同。在Python中有什么功能可以区分它们吗？最佳答案要区分填充轮廓和未填充轮廓，可以在使用 cv2.findContours 查
java - 基于条件的 Spring 轮廓
是否可以根据 Activity 配置文件的某些表达式来注册bean前任。 @Profile(!prod) @Profile(name!="test") 我有一种情况，我需要根据许多不同的条件配
iphone - 重叠的 CAShapeLayer 轮廓
我有一个由多个 CAShapeLayer 组成的 3D 相似图形对象。必须抚摸所有形状(天花板和墙壁)。有些形状共享一条边 - 这似乎是问题的根源。然而，轮廓似乎是围绕另一个形状的现有轮廓绘制的。所
javascript - 表单中输入元素周围的 CSS 轮廓
有谁知道，是否可以在用户使用顺序导航(TAB 按钮)时在输入元素周围显示轮廓，并在用户用鼠标单击此输入元素时隐藏轮廓？有没有人实现过这种行为？我在 CSS 文件中的 :focus 选择器上使用这个属
css - 悬停时围绕框阴影的 Firefox 轮廓
这是我在 StackOverflow 上的第一个问题，所以我会尝试以正确的方式格式化它。基本上，我有一个带有边框和轮廓的 div。悬停时，div 也会有一个阴影，当然，它应该在轮廓之外。这适用于所有
c++ - 如何水平连接 OpenCV 轮廓？
我在 Opencv 2.9 (C++) 中使用 findContours。我得到的是一个 vector> contours，它描述了我的轮廓。假设我有一个矩形，其轮廓存储在 vector 中。接下来我
javascript - 仅围绕父元素的 CSS 轮廓
我有一个 div，它有附加的子 div，定位在父 div 之外。我希望父 div 有一个轮廓 onclick，但轮廓延伸到子 div 周围。有没有办法让轮廓完全围绕父 div。我不能使用边框，因
css - ionic 图标周围的阴影/轮廓
我正在尝试在彩色图标周围设置实线边框。应该足够直截了当，显然它适用于字形，但我无法让它适用于我试过... // like this fiddle: http://jsfiddle.net/9s

首页

博学

6Ren·AI

商城

pdf - 将 PDF 文本转换为轮廓？

分割 PDF

将各个页面转换为 SVG

结合步骤 1 和 2

将 SVG 组装成单个 PDF