python - 结构错误 : unpack requires a string argument of length 16-6ren

python - 结构错误 : unpack requires a string argument of length 16

转载作者：太空狗更新时间：2023-10-29 21:19:22

28

4

处理 PDF 时 file (2.pdf)使用 pdfminer (pdf2txt.py) 我收到以下错误:

pdf2txt.py 2.pdf 

Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 115, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/local/bin/pdf2txt.py", line 109, in main
    interpreter.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents
    self.init_resources(resources)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = self.get_font(None, subspec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font
    font = PDFCIDFont(self, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__
    StringIO(self.fontfile.get_data()))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

虽然类似file (1.pdf)不会造成问题。

我找不到有关该错误的任何信息。我添加了一个 issue在 pdfminer GitHub 存储库上，但仍未得到答复。有人可以向我解释为什么会这样吗？我能做些什么来解析 2.pdf ？

更新:在 installing pdfminer 之后，我用 BytesIO 而不是 StringIO 得到了类似的错误直接来自 GitHub 存储库。

    $ pdf2txt.py 2.pdf 
Traceback (most recent call last):
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main
    interpreter.process_page(page)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents
    self.init_resources(resources)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font
    font = self.get_font(None, subspec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = PDFCIDFont(self, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__
    BytesIO(self.fontfile.get_data()))
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

最佳答案

TL;恢复

感谢@mkl 和@hynecker 提供的额外信息...据此我可以确认这是 pdfminer 和您的 PDF 中的错误。每当 pdfminer 尝试获取嵌入式文件流(例如字体定义)时，它都会在 endobj 之前选取文件中的最后一个。遗憾的是，并非所有 PDF 都严格添加结束标记，因此 pdfminer 应该对此有弹性。

快速修复此问题

我创建了一个补丁 - 已作为拉取请求提交到 github 上。参见 https://github.com/euske/pdfminer/pull/159 .

详细诊断

如其他答案中所述，您看到这种情况的原因是您没有从流中获得预期的字节数，因为 pdfminer 正在解压缩数据。但是为什么？

正如您在堆栈跟踪中看到的那样，pdfminer(正确地)发现它有一个要处理的 CID 字体。然后它继续将嵌入的字体文件处理为 TrueType 字体(在 pdffont.py 中)。它尝试通过读取一组二进制表来解析关联的流(流 ID 18)。

这不适用于 2.pdf，因为它有一个文本流。您可以通过运行 dumppdf -b -i 18 2.pdf 来查看。我把开始放在这里:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...

所以，垃圾输入，垃圾输出...这是您的文件或 pdfminer 中的错误吗？好吧，其他读者可以处理它的事实让我怀疑。

再仔细研究一下，我发现该流与流 ID 17 相同，后者是 ToUnicode 字段的 cmap。快速浏览 PDF spec表明这些不能相同。

进一步深入研究代码，我发现所有流都获得相同的数据。哎呀!这是错误。原因似乎与此 PDF 缺少一些结束标记这一事实有关 - 如@hynecker 所述。

修复方法是为每个流返回正确的数据。任何其他只是吞下错误的修复都会导致错误的数据被用于所有流，例如，不正确的字体定义。

我相信随附的补丁可以解决您的问题，并且通常可以安全使用。

关于python - 结构错误 : unpack requires a string argument of length 16，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40158637/

28

4

0

文章推荐： c++ - 查找树中两个节点的最小公共(public)祖先

文章推荐： c# - 获取给定星期年份、给定月份和给定星期的开始和结束日期

文章推荐： c# - 是否可以使用 Obsolete 属性标记 .net 函数

文章推荐： python - Pandas - 使用 datetimeindex 对数据框进行排序

python - "length and length"与 Python 中的 "length"有什么不同吗？
我找到了以下代码片段: length = length and length or len(string) 在我看来，这应该等同于: length = length or len(string) 我能
python - 一维数组形状 (length,) vs. (length,1) vs. (length)
当我使用 numpy.shape() 检查数组的形状时，我有时会得到 (length,1) 有时会得到 (length,)。看起来区别在于列向量与行向量......但它似乎并没有改变数组本身的任何内容
java - 在 Java 中这是什么意思 "length >= 0 ? length : length * -1"
我正在学习 Java，有一个简单的问题。在设置类的示例中，我看到了这一点: length >= 0 ? length : length * -1 这是什么意思？谢谢。最佳答案这是一种骇人听闻的
ruby - Ruby 的 length 方法是一个符号吗？为什么是:length sometimes the same as length?
我在阅读有关在 Ruby 中重新定义方法有多么容易的文章时遇到了以下问题: class Array alias :old_length :length def length old_l
java - .length() 与 .getText().length() 与 .getText().toString().length()
例如在下面的代码中a和b和c是相等的。 EditText editText; editText = (EditText) findViewById(R.id.edttxt); editText.set
javascript - 为什么 `Array.length` 、 `Function.length` 、 `String.length` 等返回 1？
在昨天教授我的 JavaScript 类(class)时，我和我的学生遇到了一些有趣的功能，我认为这些功能可能值得在一个问题和我得出的答案中捕捉到。在 Chrome 的 JS 控制台中输入 Arra
java - 何时使用 .length 与 .length()
这个问题在这里已经有了答案: How can I get the size of an array, a Collection, or a String in Java? (3 个回答) 3年前关闭。
java - length 和 length() 有什么区别？
这个问题在这里已经有了答案: length and length() in Java (8 个答案) 关闭 6 年前。我注意到在计算数组的长度时，你会这样写: arrayone.length; 但
angular - this.slides.length() 无法读取未定义的属性 'length'
console.log(this.slides.length()); 打印 Cannot read property 'length' of undefined.在 setTimeout 为 100
r - 从CRAN安装软件包时警告 “downloaded length != reported length”
在搜索stackoverflow问题时，我发现了此链接: Error in file.download when downloading custom file。但是，我的情况有些不同(我认为):
r - seq(...) 参数 "length.out"与 "length"
这个问题已经有答案了: Why does R use partial matching? (1 个回答) 已关闭 8 年前。大家。我刚刚开始使用 swirl 学习 R 编程。我刚刚了解到seq 。
r - seq(...) 参数 "length.out"与 "length"
这个问题已经有答案了: Why does R use partial matching? (1 个回答) 已关闭 8 年前。大家。我刚刚开始使用 swirl 学习 R 编程。我刚刚了解到seq 。
java - 使用 .length 和 .length() 求长度有什么区别
这个问题已经有答案了: How can I get the size of an array, a Collection, or a String in Java? (3 个回答) 已关闭 9 年前。
javascript - 在没有 length 属性的变量上使用 .length 会导致崩溃
我有一个大数组，其中包含所有类型( bool 值，数组，null，...)，并且我正在尝试访问它们的属性arr[i].length，但有些其中显然没有长度。我不介意那些缺少长度的人是否返回未定义(我
javascript 测试 .length 和 .length > 0
我在对象的属性中有一些文本。我正在测试对象的属性中是否有要显示的文本；如果没有，那么我显示“-”而不是空白。看起来没有什么区别: if (MyObject.SomeText && MyObject.S
java - String.length() 与 Array.length
这个问题在这里已经有了答案: 关闭 10 年前。 Possible Duplicate: Why is String.length() a method? Java - Array's length
javascript - (obj.length === +obj.length) 比较什么？
这个问题在这里已经有了答案: obj.length === +obj.length in javascript (4 个答案) 关闭 9 年前。我一直在读underscore.js源代码并在 _.
c++ - 两个单词的长度相加产生错误答案(string0.length() + string1.length())
#include using std::cout; using std::cin; using std::string; int main(){ cout > name; cout
javascript - obj.length 什么时候不等于+obj.length？
我正在细读 underscore.js annotated source当我遇到这个时: if (obj.length === +obj.length) {...} 我现在从this stackove
c# - (args 之间的区别是 { Length : > 0}) and args. Length?
我正在查看 dotnet 运行时中的一些代码，我注意到不是这样写的: if (args.Length > 0) 他们使用这个: if (args is { Length: > 0}) 你知道用第二种方

首页

博学

6Ren·AI

商城

python - 结构错误 : unpack requires a string argument of length 16