pdf - 使用ghostscript 处理(重新映射)PDF 中丢失/有问题的(CID/CJK)字体？-6ren

pdf - 使用ghostscript 处理(重新映射)PDF 中丢失/有问题的(CID/CJK)字体？

转载作者：行者123 更新时间：2023-12-04 12:11:16

简而言之，我正在处理一个有问题的 PDF，它:

无法在像 evince 这样的文档查看器中完全呈现, 因为缺少字体信息；

然而 - ghostscript可以完全渲染相同的 PDF。

因此——不管怎样 ghostscript用于填充空白(可能是后备字形，或访问字体的不同方法)--我希望能够使用 ghostscript生成(“提取”)输出 PDF，除了添加字体信息外，几乎没有任何变化，所以 evince可以以与 ghostscript 相同的方式呈现相同的文档能够。

我的问题是这样 - 这有可能吗？如果是这样，命令行是什么来实现这样的目标？

非常感谢您的任何答案，
干杯!

细节:

我实际上使用的是较旧的 Ubuntu 10.04，我可能会遇到 - 不是错误 - 而是 evince 的安装问题(缺少 poppler-data 包)，如 Bug #386008 “Some fonts fail to display due to “Unknown font tag...” : Bugs : “poppler” package : Ubuntu 中所述.

然而，这正是我想要处理的，所以我将使用 fontspec.pdf附在那个帖子上(“ PDF triggering the bug.”，// v.)来演示这个问题。
evince
首先，我在 evince 中打开此 pdf 的第 3 页;和 evince提示:

$ evince --page-label=3 fontspec.pdf

Error: Missing language pack for 'Adobe-Japan1' mapping
Error: Unknown font tag 'F5.1'
Error (7597): No font in show
Error: Unknown font tag 'F5.1'
Error (7630): No font in show
Error: Unknown font tag 'F5.1'
Error (7660): No font in show
Error: Unknown font tag 'F5.1'
...

渲染看起来像这样:

...很明显，缺少某些字体形状。

土坯 acroread
只是关于 Adobe 的 Acrobat Reader for Linux 行为的说明；以下命令行:

$ ./Adobe/Reader9/bin/acroread /a "page=3" fontspec.pdf

... 不会向终端生成任何输出(有关 /a 开关的更多信息，请参阅 Man page acroread )——并且程序显示字体绝对没有问题。

另外，虽然我想避免往返 postscript - 但是，请注意 acroread本身可用于将 PDF 转换为 postscript:

$ ./Adobe/Reader9/bin/acroread -v
9.5.1

$ ./Adobe/Reader9/bin/acroread -toPostScript \ 
-rotateAndCenter -choosePaperByPDFPageSize \
-start 3 -end 3 \
-level3 -transQuality 5 \
-optimizeForSpeed -saveVM \
fontspec.pdf ./

同样，上面的命令行不会向终端生成任何输出； -optimizeForSpeed -saveVM在那里是因为显然他们处理字体；最后一个参数 ./是输出目录(输出文件自动命名为 fontspec.ps )。

现在， evince可以在 fontspec.ps中显示以前缺失的字体输出 - 但再次提示:

$ evince fontspec.ps 
GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
...

...此外，所有文本似乎都在后记中被展平为曲线-因此现在无法在 evince 中选择 .ps 文件中的文本不再(注意 .ps 文件不能在 acroread 中打开)。但是，您可以再次将此 .ps 转换回 .pdf:

$ pstopdf fontspec.ps   # note, `pstopdf` has no output filename option;
                        # it will automatically choose 'fontspec.pdf',
                        # and overwrite previous 'fontspec.pdf' in 
                        # the same directory

...现在输出中的文本 pstopdf可以在 evince 中选择，所有字体都在， evince不再提示。但是，正如我所指出的，我想完全避免往返 postscript 文件。
display (来自 imagemagick)

我们也可以用 imagemagick观察同一文档中的页面s display (请注意， image panning from the commandline using 'display' 显然仍然不可用，所以我使用了下面的 -crop 来调整视口(viewport)):

$ display -density 150 -crop 740x450+280+200 fontspec.pdf[2]
   **** Warning: considering '0000000000 00000 n' as a free entry.
...
   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

...产生一些 ghostscrip ish 错误 - 结果如下:

... 很明显 evince 缺少的字体无法渲染，现在显示在这里，带有 imagemagick s display ，适本地。
ghostscript
最后，我们可以 use ghostscript as x11 viewer本身——观察相同的页面，相同的文档:

$ gs -sDevice=x11 -g740x450 -r150x150 -dFirstPage=3 \
-c '<</PageOffset [-120 520]>> setpagedevice' \
-f fontspec.pdf

GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
Processing pages 3 through 74.
Page 3
>>showpage, press <return> to continue<<
^C

...以及此输出的结果:

总之: ghostscript (显然通过扩展， imagemagick )似乎可以找到丢失的字体(或至少是它的一些替代品)，并用它渲染页面——即使 evince同一文档的失败。

因此，我只想从 ghostscript 导出 PDF 版本。，那只会嵌入缺失的字体，没有其他处理；所以我试试这个:

$ gs -dBATCH -dNOPAUSE -dSAFER  \
-dEmbedAllFonts -dSubsetFonts=true -dMaxSubsetPct=99 \
-dAutoFilterMonoImages=false \
-dAutoFilterGrayImages=false \
-dAutoFilterColorImages=false \
-dDownsampleColorImages=false \
-dDownsampleGrayImages=false \
-dDownsampleMonoImages=false \
-sDEVICE=pdfwrite \
-dFirstPage=3 -dLastPage=3 \
-sOutputFile=mypg3out.pdf -f fontspec.pdf

GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
Processing pages 3 through 3.
Page 3

   **** This file had errors that were repaired or ignored.
   **** The file was produced by:
   **** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

...但它不起作用 - 输出文件 mypg3out.pdf在 evince 中遇到完全相同的问题如前所述。

注意:虽然我想避免 postscript 往返，这是一个很好的例子 gs带有字体嵌入的从 pdf 到 ps 的命令行在这里: (#277826) pdf - How to make GhostScript PS2PDF stop subsetting fonts ;但是将 .pdf 转换为 .pdf 的相同命令行开关似乎对上述问题没有任何影响。

最佳答案

是的，我对此有更深入的了解(但不完全) - 所以我会在这里发布部分答案/评论。

本质上，这不是 PDF 中字体嵌入的问题——这是字体映射的问题。

为了证明这一点，让我们分析 mypg3out.pdf ，由 gs 提取在 OP 中(来自 fontspec.pdf 文档的第 3 页):

$ pdffonts mypg3out.pdf 
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Error: Missing language pack for 'Adobe-Japan1' mapping
CAAAAA+Osaka-Mono-Identity-H         CID TrueType      yes yes yes     19  0
GBWBYF+CMMI9                         Type 1C           yes yes yes     28  0
FDFZUN+Skia-Regular_wght13333_wdth11999 TrueType          yes yes yes     16  0
ZRLTKK+Optima-Regular                TrueType          yes yes yes     30  0
ZFQZLD+FPLNeu-Bold                   Type 1C           yes yes yes      8  0
DDRFOG+FPLNeu-Italic                 Type 1C           yes yes no      22  0
HMZJAO+FPLNeu-Regular                Type 1C           yes yes yes     10  0
RDNKXT+FPLNeu-Regular                Type 1C           yes yes yes     32  0
GBWBYF+Skia-Regular_wght13333_wdth11999 TrueType          yes yes no      26  0

正如输出所示 - 所有字体确实都被嵌入了；所以还有其他问题。 (要在完整的 fontspec.pdf 中观察到这一点会更加困难，因为那里有大量字体和大量错误消息。)

这里的关键点(我认为)是:

只有一条“Error: Missing language pack for 'Adobe-Japan1' mapping”消息；和

只有一个 CID TrueType字体，即 CAAAAA+Osaka-Mono-Identity-H

CID TrueType之间似乎有明显的关系以及“Adobe-Japan1”映射错误；我终于通过 CID fonts - How to use Ghostscript 澄清了这一点:

CID fonts are PostScript resources containing a large number of glyphs (e.g. glyphs for Far East languages, Chinese, Japanese and Korean). Please refer to the PostScript Language Reference, third edition, for details.

CID font resources are a different kind of PostScript resource from fonts. In particular, they cannot be used as regular fonts. CID font resources must first be combined with a CMap resource, which defines specific codes for glyphs, before it can be used as a font. This allows the reuse of a collection of glyphs with different encodings.

一切都很好——除了这里我们处理的是 PDF 字体，而不是 PostScript 字体；让我们证明一下。

例如， 5.3. Using Ghostscript To Preview Fonts - Making Fonts Available To Ghostscript - Font HowTo描述了 Ghostscript 安装的脚本如何调用 prfont.ps可用于渲染字体表。

但是，在这里只需使用 Listing Ghostscript Fonts [gs-devel] 会更容易，并使用 resourcestatus operator查询特定字体 - 不需要特殊的 .ps 脚本:

$ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
-c 'currentpagedevice (*) {=} 100 string /Font resourceforall'
...
Processing pages 1 through 1.
Page 1
URWAntiquaT-RegularCondensed
Palatino-Italic
Hershey-Gothic-Italian
...

$ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
-c '/TimesNewRoman findfont pop [/TimesNewRoman /Font resourcestatus]'
....
Processing pages 1 through 1.
Page 1
Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Querying operating system for font files...
Loading TimesNewRomanPSMT font from /usr/share/fonts/truetype/msttcorefonts/times.ttf... 2549340 1142090 3496416 1237949 1 done.

我们得到了一个字体列表；但是，这些是 ghostscript 可用的系统字体。 - 不是嵌入在 PDF 中的字体!

(基本上，

gs -o /dev/null -dNODISPLAY -f mypg3out.pdf -c 'currentpagedevice (*) {=} 100 string /Font resourceforall' | grep -i osaka

将不返回任何内容，并且

-c '/CAAAAA+Osaka-Mono-Identity-H findfont pop [/CAAAAA+Osaka-Mono-Identity-H /Font resourcestatus]'会以“在系统上没有找到这种字体!用 CAAAA+Osaka-Mono-Identity-H 替换字体 Courier”结束。)

要列出 PDF 中的字体， pdf_info.ps script file可以使用来自 Ghostscript(未安装，在源代码中):

$ wget "http://git.ghostscript.com/?p=ghostpdl.git;a=blob_plain;f=gs/toolbin/pdf_info.ps" -O pdf_info.ps

$ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsNeeded pdf_info.ps
...
No system fonts are needed.

$ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsUsed -dShowEmbeddedFonts pdf_info.ps
...
Font or CIDFont resources used:
CAAAAA+Osaka-Mono
DDRFOG+FPLNeu-Italic
FDFZUN+Skia-Regular_wght13333_wdth11999
GBWBYF+CMMI9
GBWBYF+Skia-Regular_wght13333_wdth11999
GTIIKZ+Osaka-Mono
HMZJAO+FPLNeu-Regular
RDNKXT+FPLNeu-Regular
ZFQZLD+FPLNeu-Bold
ZRLTKK+Optima-Regular

所以最后我们可以观察到 CAAAAA+Osaka-Mono在 Ghostscript 中 - 尽管我不知道如何从 ghostscript 中查询有关它的更具体信息.

最后，我猜我的问题归结为:怎么可能 ghostscript用于将字形从 CID 嵌入字体映射到具有不同“编码”(或“字符映射”？)的字体，这不需要额外的语言文件？

附录

我也尝试过这些方法:

pdffonts在此处的输出中不会列出 Osaka-Mono，但它仍会提示“错误:‘Adobe-Japan1’映射缺少语言包”:

$ wget http://whalepdfviewer.googlecode.com/svn/trunk/cmaps/japanese/Adobe-Japan1-UCS2$ gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH -f mypg3out.pdf Adobe-Japan1-UCS2

same as previously - this (via Ghostscript's "Use.htm") also makes Osaka-Mono disappear from pdffonts list:

gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \-c '/CIDSystemInfo << /Registry (Adobe) /Ordering (Unicode) /Supplement 1 >>' \-f mypg3out.pdf

this crashes with Error: /undefinedresource in findresource:

gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \-c '/Osaka-Mono-Identity-H /H /CMap findresource [/Osaka-Mono-Identity /CIDFont findresource] == ' \-f mypg3out.pdf

Note finally that some of the .ps scripts ghostscript installs, it may use automatically; for instance, you can find gs_ttf.ps:

$ locate gs_ttf.ps
/usr/share/ghostscript/9.02/Resource/Init/gs_ttf.ps

...然后使用

sudo nano locate gs_ttf.ps ，您可以添加语句 (Hello from gs_ttf.ps\n) print在代码的开头；然后每当上述 gs 之一命令被调用，打印输出将在标准输出中可见。

引用
 Adding your own fonts - Fonts and font facilities supplied with Ghostscript 
 About "CIDFnmap" of Ghostscript - Features to support CJK CID-keyed in Ghostscript 
 Bug 689538 – GhostScript can not handle an embedded TrueType CID-Font 
 Bug 692589 – "Error CIDSystemInfo and CMap dict not compatible" when converting merged file to PDF/A - #1522 
 Adobe Forums: CMap resources versus PDF mapping resources :
   
    Please keep in mind that a CMap resource unidirectionally maps character codes to CIDs. Those other resources that Acrobat uses are best referred to as PDF mapping resources. Among them, there is a special category called ToUnicode mapping resources that unidirectionally map CIDs to UTF-16BE character codes 
   
 
 Adobe CIDs and glyphs in CJK TrueType font 
 Ghostscript and Japanese TrueType font 
 Installation guide: GS and CID font 
 Debian -- Filelist of package poppler-data/sid/all

 
  
  关于pdf - 使用ghostscript 处理(重新映射)PDF 中丢失/有问题的(CID/CJK)字体？，我们在Stack Overflow上找到一个类似的问题：  https://stackoverflow.com/questions/11093051/

文章推荐： public-key-encryption - 如何获取网页的公钥？

文章推荐： r - 美人鱼图断线

文章推荐： r - 给定向量和 0 沿反对角线排列生成 5x5 矩阵

pdf - 将多个 PDF 合并为一个 PDF
我的代码有一些问题。我正在尝试遍历包含许多 PDF 的 Drive 文件夹，然后将它们合并为一个文件。当我使用我的代码时，它只是为 Drive 文件夹中的最后一个 PDF 创建一个 PDF，而不是按预
pdf - PDF 规范中的最小 PDF 示例
我从 PDF Specification 获取了 PDF 规范中的最小 PDF 示例。，将其复制到记事本，将文件重命名为扩展名为 .pdf。我可以用其他 PDF 查看器(PDF-XChange、S
pdf - 在不破坏可访问性或 PDF 标签的情况下连接 PDF
感谢您在以下方面的帮助: 我有 2 个部分可访问的 PDF(包含标签)，我想使用一些命令行工具(如 PDFtk 或 Ghostscript，或任何 Perl 模块)将它们连接起来: 我已经尝试使用 P
pdf - Ghostscript - 将矢量 pdf 转换为光栅 pdf
我想使用 ghostscript 将矢量 pdf 转换为光栅 pdf(即光栅化矢量 pdf)。但是即使我添加了解析参数 -r300，我也找不到合适的参数来执行此操作。我使用的代码是-dSAFER -
pdf - iTextSharp 可以将 PDF 文档转换为 PDF/A
我无法在 FAQ 中找到这个功能是否存在于 API 中，尽管它在书中提到作为潜在可用的东西。有没有人有任何实现此功能的经验？最佳答案在 This thread (日期为 2007 年 6 月)Pa
pdf - 使用 pdf.js 在网站上显示 PDF
我要放文件sample.pdf在我的网站上，并希望使用 pdf.js 显示它.我想要的是显示我自己的文件，如 demo ，带有工具栏，放大/缩小等。到目前为止，我还不能这样做。我确实检查了 hell
pdf - 将 PDF 转换为 PDF/A-1
我知道这可能不是严格意义上的编程问题(也许是，我不知道)但我在尝试转换常规 pdf(带有超链接、书签、图像、嵌入字体等)时遇到了严重问题.) 转换为 PDF/A-1 格式。当我用 pdfaPilot
PDF.js 能够创建 pdf 文件或 PDF.js 只是一个让 PDF 文件显示在网络浏览器上的功能？
这是 PDF.js 网站 https://github.com/mozilla/pdf.js 我正在搜索和阅读很多文章，大多数编码都是将 pdf 导入 pdf.js 并在浏览器上显示，我不明白是不是
pdf-generation - 扫描图像/PDF 到可搜索图像/PDF
谁能建议我如何将扫描图像转换为可搜索图像或如何将扫描 pdf 转换为可搜索 pdf？很长一段时间以来，我一直陷入这种情况。我已经在 ubuntu 中尝试过 pdfocr 应用程序，但没有成功。最
pdf - Itext pdf 延迟签名导致 pdf 签名无效
作为我对客户端/服务器 pdf 签名研究的一部分，我测试了 itext pdf 延迟签名示例。不幸的是，我生成的 pdf 即合并空签名 pdf 和哈希值的输出显示无效签名。我的代码片段如下 cla
pdf - 在 PDF 中插入 PDF(不合并文件)
我想将一个 PDF 页面插入到另一个已缩放的 PDF 页面中。我想使用 iTextSharp 来实现此目的。我有一个矢量绘图，可以导出为单页 PDF 文件。我想将此文件添加到其他 PDF 文档的页面
pdf - Itext pdf 延迟签名导致 pdf 签名无效
作为我对客户端/服务器 pdf 签名研究的一部分，我测试了 itext pdf 延迟签名示例。不幸的是，我生成的 pdf 即合并空签名 pdf 和哈希值的输出显示无效签名。我的代码片段如下 cla
pdf - 裁剪基于文本的多页 PDF 文档的白边并将其转换为基于图像的 PDF 文档
我想为 Kindle 转换电子书。我尝试使用 Calibre 将具有复杂格式样式和图像的基于两种语言的基于文本的大型 PDF 电子书转换为适用于 Kindle 的 AZW3 电子书，并且还尝试了亚马逊
pdf - 如何仅使用 Adobe PDF 插件强制在谷歌浏览器中显示 PDF
我在 Google Chrome 中显示 pdf 时遇到问题。问题是 Chrome 将 pdf 的某些页面显示为黑色。启用 Chrome PDF 查看器时会发生这种情况。如果我禁用此插件并使用 Ad
pdf - 打印时将空白页插入 PDF
我确信这个问题无处不在，尽管我似乎找不到答案。我希望我的 PDF 文档在 PDF 阅读器中显示时没有空白页，但随后在封面后打印空白页，这样打印出来的文档在右侧甚至左侧都有奇数页。还有其他人遇到过这个问
pdf - 从命令行自动裁剪 pdf
我需要自动裁剪 pdf 文件(去除白边)。到目前为止，我尝试了两种并不完美的工具: pdf裁剪问题:它不会裁剪某些 pdf。 pdf-crop-margins 问题:有时它裁剪得太多(精细的细节)。
pdf - PDF 中的透明图像
This PDF由几个源文件组成。其中五个是包含 alpha channel 的 PNG。一种是没有 alpha channel 的 PNG。最后一 block 是带有透明效果的 Photoshop
pdf - 将内部维基页面转换为 PDF
我的团队将内部 wiki 页面用于各种内容。这些页面是使用 MediaWiki 创建的。我想知道是否有任何方法可以将 wiki 页面转换为 PDF 格式。我必须用它来将用户文档转换为 PDF 格式，以
pdf - 从结构化数据生成 PDF
我希望能够从我可能在数据库或 xml 或任何其他结构化形式中拥有的数据生成高度图形化(也包含大量文本内容)的 PDF 文件。目前，我们的平面设计师在将内容作为 MS Word 文档后，在 Photo
pdf - 查找重复的 PDF
我正在寻找可以帮助我找到重复 PDF 的实用程序。问题:我有 1000 个 PDF 文件。有些是重复的。由于不同的文件名和文件大小的微小差异，它们不容易被检测到。是否有实用程序/算法/库可以帮助我找到

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

pdf - 使用ghostscript 处理(重新映射)PDF 中丢失/有问题的(CID/CJK)字体？