gpt4 book ai didi

pdf - 使用ghostscript 处理(重新映射)PDF 中丢失/有问题的(CID/CJK)字体?

转载 作者:行者123 更新时间:2023-12-04 12:11:16 24 4
gpt4 key购买 nike

简而言之,我正在处理一个有问题的 PDF,它:

  • 无法在像 evince 这样的文档查看器中完全呈现, 因为缺少字体信息;
  • 然而 - ghostscript可以完全渲染相同的 PDF。

  • 因此——不管怎样 ghostscript用于填充空白(可能是后备字形,或访问字体的不同方法)--我希望能够使用 ghostscript生成(“提取”)输出 PDF,除了添加字体信息外,几乎没有任何变化,所以 evince可以以与 ghostscript 相同的方式呈现相同的文档能够。

    我的 问题 是这样 - 这有可能吗?如果是这样,命令行是什么来实现这样的目标?

    非常感谢您的任何答案,
    干杯!

    细节:

    我实际上使用的是较旧的 Ubuntu 10.04,我可能会遇到 - 不是错误 - 而是 evince 的安装问题(缺少 poppler-data 包),如 Bug #386008 “Some fonts fail to display due to “Unknown font tag...” : Bugs : “poppler” package : Ubuntu 中所述.

    然而,这正是我想要处理的,所以我将使用 fontspec.pdf附在那个帖子上(“ PDF triggering the bug.”,// v.)来演示这个问题。
    evince
    首先,我在 evince 中打开此 pdf 的第 3 页;和 evince提示:
    $ evince --page-label=3 fontspec.pdf

    Error: Missing language pack for 'Adobe-Japan1' mapping
    Error: Unknown font tag 'F5.1'
    Error (7597): No font in show
    Error: Unknown font tag 'F5.1'
    Error (7630): No font in show
    Error: Unknown font tag 'F5.1'
    Error (7660): No font in show
    Error: Unknown font tag 'F5.1'
    ...

    渲染看起来像这样:

    evince-pdf-missfont-render.png

    ...很明显,缺少某些字体形状。

    土坯 acroread
    只是关于 Adob​​e 的 Acrobat Reader for Linux 行为的说明;以下命令行:
    $ ./Adobe/Reader9/bin/acroread /a "page=3" fontspec.pdf

    ... 不会向终端生成任何输出(有关 /a 开关的更多信息,请参阅 Man page acroread )——并且程序显示字体绝对没有问题。

    另外,虽然我想避免往返 postscript - 但是,请注意 acroread本身可用于将 PDF 转换为 postscript:
    $ ./Adobe/Reader9/bin/acroread -v
    9.5.1

    $ ./Adobe/Reader9/bin/acroread -toPostScript \
    -rotateAndCenter -choosePaperByPDFPageSize \
    -start 3 -end 3 \
    -level3 -transQuality 5 \
    -optimizeForSpeed -saveVM \
    fontspec.pdf ./

    同样,上面的命令行不会向终端生成任何输出; -optimizeForSpeed -saveVM在那里是因为显然他们处理字体;最后一个参数 ./是输出目录(输出文件自动命名为 fontspec.ps )。

    现在, evince可以在 fontspec.ps中显示以前缺失的字体输出 - 但再次提示:
    $ evince fontspec.ps 
    GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
    GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
    ...

    ...此外,所有文本似乎都在后记中被展平为曲线-因此现在无法在 evince 中选择 .ps 文件中的文本不再(注意 .ps 文件不能在 acroread 中打开)。但是,您可以再次将此 .ps 转换回 .pdf:
    $ pstopdf fontspec.ps   # note, `pstopdf` has no output filename option;
    # it will automatically choose 'fontspec.pdf',
    # and overwrite previous 'fontspec.pdf' in
    # the same directory

    ...现在输出中的文本 pstopdf可以在 evince 中选择,所有字体都在, evince不再提示。但是,正如我所指出的,我想完全避免往返 postscript 文件。
    display (来自 imagemagick)

    我们也可以用 imagemagick观察同一文档中的页面s display (请注意, image panning from the commandline using 'display' 显然仍然不可用,所以我使用了下面的 -crop 来调整视口(viewport)):
    $ display -density 150 -crop 740x450+280+200 fontspec.pdf[2]
    **** Warning: considering '0000000000 00000 n' as a free entry.
    ...
    **** This file had errors that were repaired or ignored.
    **** The file was produced by:
    **** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
    **** Please notify the author of the software that produced this
    **** file that it does not conform to Adobe's published PDF
    **** specification.

    ...产生一些 ghostscrip ish 错误 - 结果如下:

    imagemagick-display-pdf.png

    ... 很明显 evince 缺少的字体无法渲染,现在显示在这里,带有 imagemagick s display , 适本地。
    ghostscript
    最后,我们可以 use ghostscript as x11 viewer本身——观察相同的页面,相同的文档:
    $ gs -sDevice=x11 -g740x450 -r150x150 -dFirstPage=3 \
    -c '<</PageOffset [-120 520]>> setpagedevice' \
    -f fontspec.pdf

    GPL Ghostscript 9.02 (2011-03-30)
    Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
    This software comes with NO WARRANTY: see the file PUBLIC for details.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    Processing pages 3 through 74.
    Page 3
    >>showpage, press <return> to continue<<
    ^C

    ...以及此输出的结果:

    ghostscript-pdf-view.png

    总之: ghostscript (显然通过扩展, imagemagick )似乎可以找到丢失的字体(或至少是它的一些替代品),并用它渲染页面——即使 evince同一文档的失败。

    因此,我只想从 ghostscript 导出 PDF 版本。 ,那只会嵌入缺失的字体,没有其他处理;所以我试试这个:
    $ gs -dBATCH -dNOPAUSE -dSAFER  \
    -dEmbedAllFonts -dSubsetFonts=true -dMaxSubsetPct=99 \
    -dAutoFilterMonoImages=false \
    -dAutoFilterGrayImages=false \
    -dAutoFilterColorImages=false \
    -dDownsampleColorImages=false \
    -dDownsampleGrayImages=false \
    -dDownsampleMonoImages=false \
    -sDEVICE=pdfwrite \
    -dFirstPage=3 -dLastPage=3 \
    -sOutputFile=mypg3out.pdf -f fontspec.pdf

    GPL Ghostscript 9.02 (2011-03-30)
    Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
    This software comes with NO WARRANTY: see the file PUBLIC for details.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    **** Warning: considering '0000000000 00000 n' as a free entry.
    Processing pages 3 through 3.
    Page 3

    **** This file had errors that were repaired or ignored.
    **** The file was produced by:
    **** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
    **** Please notify the author of the software that produced this
    **** file that it does not conform to Adobe's published PDF
    **** specification.

    ...但它不起作用 - 输出文件 mypg3out.pdfevince 中遇到完全相同的问题如前所述。

    注意:虽然我想避免 postscript 往返,这是一个很好的例子 gs带有字体嵌入的从 pdf 到 ps 的命令行在这里: (#277826) pdf - How to make GhostScript PS2PDF stop subsetting fonts ;但是将 .pdf 转换为 .pdf 的相同命令行开关似乎对上述问题没有任何影响。

    最佳答案

    是的,我对此有更深入的了解(但不完全) - 所以我会在这里发布部分答案/评论。

    本质上,这不是 PDF 中字体嵌入的问题——这是字体映射的问题。

    为了证明这一点,让我们分析 mypg3out.pdf ,由 gs 提取在 OP 中(来自 fontspec.pdf 文档的第 3 页):

    $ pdffonts mypg3out.pdf 
    name type emb sub uni object ID
    ------------------------------------ ----------------- --- --- --- ---------
    Error: Missing language pack for 'Adobe-Japan1' mapping
    CAAAAA+Osaka-Mono-Identity-H CID TrueType yes yes yes 19 0
    GBWBYF+CMMI9 Type 1C yes yes yes 28 0
    FDFZUN+Skia-Regular_wght13333_wdth11999 TrueType yes yes yes 16 0
    ZRLTKK+Optima-Regular TrueType yes yes yes 30 0
    ZFQZLD+FPLNeu-Bold Type 1C yes yes yes 8 0
    DDRFOG+FPLNeu-Italic Type 1C yes yes no 22 0
    HMZJAO+FPLNeu-Regular Type 1C yes yes yes 10 0
    RDNKXT+FPLNeu-Regular Type 1C yes yes yes 32 0
    GBWBYF+Skia-Regular_wght13333_wdth11999 TrueType yes yes no 26 0

    正如输出所示 - 所有字体确实都被嵌入了;所以还有其他问题。 (要在完整的 fontspec.pdf 中观察到这一点会更加困难,因为那里有大量字体和大量错误消息。)

    这里的关键点(我认为)是:
  • 只有一条“Error: Missing language pack for 'Adobe-Japan1' mapping”消息;和
  • 只有一个 CID TrueType字体,即 CAAAAA+Osaka-Mono-Identity-H
  • CID TrueType之间似乎有明显的关系以及“Adobe-Japan1”映射错误;我终于通过 CID fonts - How to use Ghostscript 澄清了这一点:

    CID fonts are PostScript resources containing a large number of glyphs (e.g. glyphs for Far East languages, Chinese, Japanese and Korean). Please refer to the PostScript Language Reference, third edition, for details.

    CID font resources are a different kind of PostScript resource from fonts. In particular, they cannot be used as regular fonts. CID font resources must first be combined with a CMap resource, which defines specific codes for glyphs, before it can be used as a font. This allows the reuse of a collection of glyphs with different encodings.



    一切都很好——除了这里我们处理的是 PDF 字体,而不是 PostScript 字体;让我们证明一下。

    例如, 5.3. Using Ghostscript To Preview Fonts - Making Fonts Available To Ghostscript - Font HowTo描述了 Ghostscript 安装的脚本如何调用 prfont.ps可用于渲染字体表。

    但是,在这里只需使用 Listing Ghostscript Fonts [gs-devel] 会更容易,并使用 resourcestatus operator查询特定字体 - 不需要特殊的 .ps 脚本:
    $ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
    -c 'currentpagedevice (*) {=} 100 string /Font resourceforall'
    ...
    Processing pages 1 through 1.
    Page 1
    URWAntiquaT-RegularCondensed
    Palatino-Italic
    Hershey-Gothic-Italian
    ...

    $ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
    -c '/TimesNewRoman findfont pop [/TimesNewRoman /Font resourcestatus]'
    ....
    Processing pages 1 through 1.
    Page 1
    Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
    Can't find (or can't open) font file TimesNewRomanPSMT.
    Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
    Can't find (or can't open) font file TimesNewRomanPSMT.
    Querying operating system for font files...
    Loading TimesNewRomanPSMT font from /usr/share/fonts/truetype/msttcorefonts/times.ttf... 2549340 1142090 3496416 1237949 1 done.

    我们得到了一个字体列表;但是,这些是 ghostscript 可用的系统字体。 - 不是嵌入在 PDF 中的字体!

    (基本上,
  • gs -o /dev/null -dNODISPLAY -f mypg3out.pdf -c 'currentpagedevice (*) {=} 100 string /Font resourceforall' | grep -i osaka将不返回任何内容,并且
  • -c '/CAAAAA+Osaka-Mono-Identity-H findfont pop [/CAAAAA+Osaka-Mono-Identity-H /Font resourcestatus]'会以“在系统上没有找到这种字体!用 CAAAA+Osaka-Mono-Identity-H 替换字体 Courier”结束。)

  • 要列出 PDF 中的字体, pdf_info.ps script file可以使用来自 Ghostscript(未安装,在源代码中):
    $ wget "http://git.ghostscript.com/?p=ghostpdl.git;a=blob_plain;f=gs/toolbin/pdf_info.ps" -O pdf_info.ps

    $ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsNeeded pdf_info.ps
    ...
    No system fonts are needed.

    $ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsUsed -dShowEmbeddedFonts pdf_info.ps
    ...
    Font or CIDFont resources used:
    CAAAAA+Osaka-Mono
    DDRFOG+FPLNeu-Italic
    FDFZUN+Skia-Regular_wght13333_wdth11999
    GBWBYF+CMMI9
    GBWBYF+Skia-Regular_wght13333_wdth11999
    GTIIKZ+Osaka-Mono
    HMZJAO+FPLNeu-Regular
    RDNKXT+FPLNeu-Regular
    ZFQZLD+FPLNeu-Bold
    ZRLTKK+Optima-Regular

    所以最后我们可以观察到 CAAAAA+Osaka-Mono在 Ghostscript 中 - 尽管我不知道如何从 ghostscript 中查询有关它的更具体信息.

    最后,我猜我的 问题 归结为:怎么可能 ghostscript用于将字形从 CID 嵌入字体映射到具有不同“编码”(或“字符映射”?)的字体,这不需要额外的语言文件?

    附录

    我也尝试过这些方法:
  • pdffonts在此处的输出中不会列出 Osaka-Mono,但它仍会提示“错误:‘Adobe-Japan1’映射缺少语言包”:
    $ wget http://whalepdfviewer.googlecode.com/svn/trunk/cmaps/japanese/Adobe-Japan1-UCS2$ gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH -f mypg3out.pdf Adobe-Japan1-UCS2
  • same as previously - this (via Ghostscript's "Use.htm") also makes Osaka-Mono disappear from pdffonts list:
    gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \-c '/CIDSystemInfo << /Registry (Adobe) /Ordering (Unicode) /Supplement 1 >>' \-f mypg3out.pdf
  • this crashes with Error: /undefinedresource in findresource:
    gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \-c '/Osaka-Mono-Identity-H /H /CMap findresource [/Osaka-Mono-Identity /CIDFont findresource] == ' \-f mypg3out.pdf
  • Note finally that some of the .ps scripts ghostscript installs, it may use automatically; for instance, you can find gs_ttf.ps:

    $ locate gs_ttf.ps
    /usr/share/ghostscript/9.02/Resource/Init/gs_ttf.ps

    ...然后使用 sudo nano locate gs_ttf.ps ,您可以添加语句 (Hello from gs_ttf.ps\n) print在代码的开头;然后每当上述 gs 之一命令被调用,打印输出将在标准输出中可见。

    引用
  • Adding your own fonts - Fonts and font facilities supplied with Ghostscript
  • About "CIDFnmap" of Ghostscript - Features to support CJK CID-keyed in Ghostscript
  • Bug 689538 – GhostScript can not handle an embedded TrueType CID-Font
  • Bug 692589 – "Error CIDSystemInfo and CMap dict not compatible" when converting merged file to PDF/A - #1522
  • Adobe Forums: CMap resources versus PDF mapping resources :
    Please keep in mind that a CMap resource unidirectionally maps character codes to CIDs. Those other resources that Acrobat uses are best referred to as PDF mapping resources. Among them, there is a special category called ToUnicode mapping resources that unidirectionally map CIDs to UTF-16BE character codes
  • Adobe CIDs and glyphs in CJK TrueType font
  • Ghostscript and Japanese TrueType font
  • Installation guide: GS and CID font
  • Debian -- Filelist of package poppler-data/sid/all
  • 关于pdf - 使用ghostscript 处理(重新映射)PDF 中丢失/有问题的(CID/CJK)字体?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11093051/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com