php - 阅读PDF，TJ算子奇怪的编码-6ren

php - 阅读PDF，TJ算子奇怪的编码

转载作者：行者123 更新时间：2023-12-04 02:16:31

24

4

我目前正在尝试从 PDF 文档中提取文本，但我遇到了一些使用 Tj 运算符的奇怪情况。通常我处理这样的情况:

   Tc (SOME_TEXT) TJ

现在我遇到这样的情况:

转换为字符串'52249.64'。现在我又遇到了另一个奇怪的案例:

我能找到的唯一信息是:传递给 Tj 的字符串始终根据字体的编码或 CMap 进行解释。 (在这种情况下，我希望它是带有 CMap 的 CIDFont)

Td  (
        \t\004\007\020\007\016\016\026\020
    )
Tj

我还是不明白。这些是指示某种字符数组中偏移量的某种索引还是我必须解码这些值？谢谢!

最佳答案

正如@Paulo 在他的评论中已经指出的那样，您应该首先查阅 PDF 规范，即目前 ISO 32000-1，Adobe 提供了免费副本 here .

关于文本提取的主题，您会在第 9.10 节文本内容提取中找到，尤其是:

9.10.2 Mapping Character Codes to Unicode Values

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

a) Map the character code to a character identifier (CID) according to the font’s CMap.

b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

NOTE Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the conforming reader. See Table 3 for a list of the character collections corresponding to a given PDF version. (Other supplements of these character collections can be used, but if the supplement is higher-numbered than the one corresponding to the supported PDF version, only the CIDs in the latter supplement are considered to be standard CIDs.)

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

如果您不知道此处的某些术语，请在 ISO 32000-1 中阅读有关它们的信息。或那里引用的其他规范。

因此，为了获得可接受的文本提取结果，请使您的文本提取器支持该部分中介绍的方法。

关于php - 阅读PDF，TJ算子奇怪的编码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33412329/

24

4

0

文章推荐： axapta - xRecord.data 方法生成具有相同 RecId 的记录

文章推荐： vim - 如何在命令行中放置占位符

文章推荐： zeromq - ZMQ : how does a router identify a dealer

java - 删除其他运算符并仅保留文本运算符(TJ,Tj) pdfBox
我有一个 pdf 文件，我希望从中删除所有图像和其他绘图内容。并将结果另存为新的 pdf。我知道如何使用 TJ 、 Tj 运算符删除文本，我目前执行的操作如下 op.getOperation().e
java - 使用 PDFBox 与 Tj 和 TJ 运算符(operator)合作
如果我们假设一个 PDF 文档“doc.pdf”包含“hello world”作为一个简单的字符串。让我们考虑一下这段代码: //read the document DDocument do
PDF TJ 运算符
是否可以确定TJ运算符中的数字是否代表单词之间的空格？示例:[(Sta)28(ry)-333(Plzenec,)]TJ 数字28空格不够，否则333应该根据实际字号空格。字体大小为 9.96 最佳答
带尖括号的 PDF Tj 命令？
我试图找出在未压缩的 PDF v1.4 文档中使用 Times 字体的位置。 /Font描述 PDF 中 Times 字体的对象是对象 65如下: 65 0 obj > endobj 它指的是 /Fo
java - 与 TJ 运算符(operator)合作
我使用 iText 库来创建并操作 PDF 文档。让我们有一个包含简单字符串的文档，例如“Hello world”。所以在pdf文件结构中，我们必须有(Hello world)Tj。问题是我如何通过使
java - "(someString) Tj"到 java 字符串编码问题 (PDFBox)
我尝试使用 PDFBox 2.0.0 解析 PDF 的内容流。这是处理它的代码的一部分: InputStream is; try { is = this.input.getDocumentC
c# - 使用 C# 插入字段名称为 TA/TJ 的数据库
最近我被分配到一个项目，我们必须将旧的 VB3 进程迁移到 C#，这个进程从 Access 97 文件中获取数据并将其插入 SQL Server，问题是一些“天才”调用字段“Ta/Tj”和“/”破坏了
c++ - 使用 PoDoFo 库从 PDF 运算符中的数组 TJ 中提取文本
我正在尝试使用 PoDoFo 库从 PDF 文件中提取文本，它适用于 Tj 运算符，但无法为 (数组)TJ 运算符。我找到了这段代码(经过我的小修改)here : const char*
javascript - TypeScript async/await 与 JS tj/co
我是否正确理解我不需要在 TypeScript 中使用像 tj/co 这样的库来控制流程，因为我可以使用 async/await？将 promises 与生成器一起使用只是 async/await 的
node.js - TJ Holowaychuk 对 Node Js 的批评
背景故事，阅读:https://medium.com/code-adventures/4ba9e7f3e52b TJ 说 Node 失败是因为: 您可能会收到重复的回调您可能根本没有收到回电(迷失方
javascript - TJ 的告别帖子中 Node.js 错误处理中的 "not get a callback at all"是什么意思？
最近看了TJ的博文:"Farewell Node.js" . 我不太了解 Node 失败部分。在这里: Error-handling in Go is superior in my opinion.
objective-c - 从 PDF 流上的 TJ 回调生成的 CGPDFArray 中复制 CGPDFStrings
好的，所以我正在解析 PDF 内容流，发现 TJ 回调生成一个字符串数组，所以我捕获它并开始遍历它以获取字符串值，如下所示: static void Op_TJ(CGPDFScannerRef s,

首页

博学

6Ren·AI

商城

php - 阅读PDF，TJ算子奇怪的编码