Python 和 zlib : Terribly slow decompressing concatenated streams-6ren

Python 和 zlib : Terribly slow decompressing concatenated streams

转载作者：太空宇宙更新时间：2023-11-04 09:09:57

24

4

我收到了一个压缩文件，其中包含多个单独的压缩 XML 流。压缩后的文件大小为 833 MB。

如果我尝试将它作为单个对象解压缩，我只会得到第一个流(大约 19 kb)。

我修改了以下代码作为对 older question 的回答解压缩每个流并将其写入文件:

import zlib

outfile = open('output.xml', 'w')

def zipstreams(filename):
    """Return all zip streams and their positions in file."""
    with open(filename, 'rb') as fh:
        data = fh.read()
    i = 0
    print "got it"
    while i < len(data):
        try:
            zo = zlib.decompressobj()
            dat =zo.decompress(data[i:])
            outfile.write(dat)
            zo.flush()
            i += len(data[i:]) - len(zo.unused_data)
        except zlib.error:
            i += 1
    outfile.close()

zipstreams('payload')
infile.close()

此代码运行并产生所需的结果(所有 XML 数据都解压缩到一个文件中)。问题是需要好几天才能生效!

尽管压缩文件中有数万个流，但看起来这应该是一个更快的过程。大约 8 天解压 833mb(估计原始 3gb)表明我做错了一些事情。

是否有另一种方法可以更有效地执行此操作，或者速度慢是读取-解压缩-写入的结果——我一直遇到的重复瓶颈？

感谢您的任何指点或建议!

最佳答案

如果不了解您实际处理的文件格式的更具体知识，很难说太多，但很明显，您的算法对子字符串的处理是二次的——当您有数万个子字符串时，这不是一件好事他们。那么让我们看看我们知道什么:

你说供应商说他们是

using the standard zlib compression library.These are the same compression routines on which the gzip utilities are built.

由此我们可以得出结论，组件流采用原始 zlib 格式，并且未封装在 gzip 包装器(或 PKZIP 存档，或其他任何格式)中。 ZLIB 格式的权威文档在这里:https://www.rfc-editor.org/rfc/rfc1950

因此，让我们假设您的文件与您描述的完全一样:一个 32 字节的 header ，后跟连接在一起的原始 ZLIB 流，中间没有任何其他内容。(编辑: 毕竟不是这样的)。

Python 的 zlib documentation提供了一个 Decompress 类，它实际上非常适合翻动您的文件。它包含一个属性 unused_data，其 documentation明确指出:

The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.

因此，这就是您可以做的:编写一个循环来读取 data，比如说，一次一个 block (甚至不需要将整个 800MB 的文件读入内存)。将每个 block 推送到 Decompress 对象，并检查 unused_data 属性。当它变成非空时，你就有了一个完整的对象。将其写入磁盘，创建一个新的解压缩对象并使用上一个的 unused_data 初始化 iw。这可能会起作用(未经测试，因此请检查其正确性)。

编辑:由于您的数据流中确实有其他数据，因此我添加了一个与下一个 ZLIB 开始对齐的例程。您需要在您的数据中找到并填写标识 ZLIB 流的双字节序列。 (随意使用您的旧代码来发现它。)虽然通常没有固定的 ZLIB header ，但每个流的 header 应该相同，因为它由 protocol options and flags, 组成。这在整个运行过程中可能是相同的。

import zlib

# FILL IN: ZHEAD is two bytes with the actual ZLIB settings in the input
ZHEAD = CMF+FLG  
    
def findstart(header, buf, source):
    """Find `header` in str `buf`, reading more from `source` if necessary"""

    while buf.find(header) == -1:
        more = source.read(2**12)
        if len(more) == 0:  # EOF without finding the header
            return ''
        buf += more
        
    offset = buf.find(header)
    return buf[offset:]

然后您可以前进到下一个流的开始。我添加了 try/except 对，因为相同的字节序列可能出现在流外:

source = open(datafile, 'rb')
skip_ = source.read(32) # Skip non-zlib header

buf = ''
while True:
    decomp = zlib.decompressobj()
    # Find the start of the next stream
    buf = findstart(ZHEAD, buf, source)
    try:    
        stream = decomp.decompress(buf)
    except zlib.error:
        print "Spurious match(?) at output offset %d." % outfile.tell(),
        print "Skipping 2 bytes"
        buf = buf[2:]
        continue
    
    # Read until zlib decides it's seen a complete file
    while decomp.unused_data == '':
        block = source.read(2**12)
        if len(block) > 0:       
            stream += decomp.decompress(block)
        else:
            break # We've reached EOF
        
    outfile.write(stream)
    buf = decomp.unused_data # Save for the next stream
    if len(block) == 0:
        break  # EOF

outfile.close()

PS 1. 如果我是你，我会将每个 XML 流写入一个单独的文件。

PS 2. 您可以测试在文件的前 MB 上执行的任何操作，直到获得足够的性能。

关于Python 和 zlib : Terribly slow decompressing concatenated streams，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16506590/

24

4

0

文章推荐： html - 如何让行元素占据整个宽度？

文章推荐： java - 从 JPA 类自动创建表

文章推荐： html - 将右侧css中的两个图像与单独的行对齐

文章推荐： python - Numpy 在 numpy 数组上排列

zlib - 不能不安装分发，zlib
起初，我只想在 python3.2 中使用 install feedparser，而它需要 Distribute。当我安装 Distribute 时 python3.2 setup.py instal
haskell - 安装 Yesod 时如何解决 zlib-enum、zlib-binding、zlib-conduit 冲突
我正在尝试在另一台计算机上安装我的 Yesod Web 应用程序。我已经在我当前的机器上很好地安装了它，并且可以cabal install它在那里没有任何问题。我似乎在另一台机器上遇到了麻烦(这是
zlib - DEFLATE (zlib, gzip) 格式使用的编码动态霍夫曼树的最大大小是多少？
https://www.ietf.org/rfc/rfc1951.txt 的“3.2.7. 使用动态霍夫曼代码压缩(BTYPE=10)”部分描述了压缩期间使用的动态哈夫曼树的编码。可能出现在 DEFL
zlib - 如何在 Elixir 中 Zlib 膨胀字节列表？
给定 Elixir 中代表压缩文件的二进制文件，我如何将它们传递到 Erlang 的 zlib 进行膨胀？ compressed = > 我已经尝试过: z = :zlib.open() uncomp
zlib - 是否有一个膨胀函数(zlib/miniz)返回膨胀/解压缩大小的上限？
我知道 zlib/miniz 提供了 compressBound，它根据纯文本大小返回压缩/压缩大小的上限。这很方便。是否有用于返回膨胀/解压缩大小上限的膨胀函数(zlib/miniz)？还是一个简
PHP (ZLIB) 解压缩 C (ZLIB) 压缩数组会返回乱码
我有一组存储在数据库中的 ZLIB 压缩/base64 编码字符串(在 C 程序中完成)。我编写了一个小型 PHP 页面，应该检索这些值并绘制它们(字符串最初是 float 列表)。压缩/编码的 C
zlib - 在 zlib 中，当字母的霍夫曼代码长度超过最大代码长度(15)时会发生什么？
在https://www.rfc-editor.org/rfc/rfc1951 Note that in the "deflate" format, the Huffman codes for the
zlib - 在 zlib 中，当字母的霍夫曼代码长度超过最大代码长度(15)时会发生什么？
在https://www.rfc-editor.org/rfc/rfc1951 Note that in the "deflate" format, the Huffman codes for the
c - 文件搜索场景中的 zlib 压缩问题，zlib 中是否有任何锁定/标志机制可以在这些机制之间保持完整性
我正在处理处理较大文件的项目，在我们的代码库中，我们会返回寻找写入证书信息，这些寻找的范围大部分时间都非常小，我想在我的流写入器/读取器中使用 zlib为了节省磁盘空间，但由于这样的搜索我无法集成它，
node.js - zlib:zlib 绑定(bind)已关闭
我正在尝试使用以下命令升级 Node 版本:npm install npm@latest -g 命令。但它给出了 zlib 绑定(bind)关闭错误。有办法解决这个问题吗？最佳答案你的 No
python - 导入 zlib ImportError : No module named zlib
这个问题在这里已经有了答案: no module named zlib (9 个回答) 关闭 4 年前。 # pythonbrew venv create django1.5 Creating `d
io.gomint.server.jni.zlib.ZLib.process()方法的使用及代码示例
本文整理了Java中io.gomint.server.jni.zlib.ZLib.process()方法的一些代码示例，展示了ZLib.process()的具体用法。这些代码示例主要来源于Github
io.gomint.server.jni.zlib.ZLib.init()方法的使用及代码示例
本文整理了Java中io.gomint.server.jni.zlib.ZLib.init()方法的一些代码示例，展示了ZLib.init()的具体用法。这些代码示例主要来源于Github/Stack
java - 如何使用 python zlib 压缩文本并使用 java zlib 解压缩？
我想使用 python zlib 压缩文本，并通过 Apache Thrift 发送压缩文本，最后我用 Java 解压了压缩文本。但我不知道该怎么做。我找不到任何像 Java 中的 python z
objective-c - 如何在不直接使用 zlib.dylib 的情况下使用 Zlib 压缩数据？
是否有允许使用 Zlib 压缩数据的类，或者直接使用 zlib.dylib 是我唯一的可能吗？最佳答案 NSData+Compression 是一个易于使用的 NSData 类别实现。 NSData
ruby - 即使在使用 rvm pkg install zlib 后也无法加载此类文件 - zlib
我使用 rvm 安装了 zlib 包和 ruby 1.9.3，但是每当我尝试安装时它说的 gem 无法加载此类文件--zlib 我用来安装的命令是 $ rvm install 1.9.3 $ rv
python - zlib.crc32 或 zlib.adler32 可以安全地用于屏蔽 URL 中的主键吗？
在 Django Design Patterns ，作者建议使用 zlib.crc32 来屏蔽 URL 中的主键。经过一些快速测试后，我注意到 crc32 大约有一半的时间会生成负整数，这似乎不适合在
c# - boost::iostreams 中的 Zlib 压缩与 zlib.NET 不兼容
我想以 ZLIB 格式在我的 C# 和 C++ 应用程序之间发送压缩数据。在 C++ 中，我使用 boost::iostreams 中可用的 zlib_compressor/zlib_decompre
python - C zlib crc32 和 Python zlib crc32 不匹配
我在 Python 和 C 中对 crc32 进行了一些试验，但我的结果不匹配。 C: #include #include #include #define NUM_BYTES 9 int ma
php - 关于 PHP 配置 :--with-zlib=DIR and --with-zlib-dir=DIR
来自 ./configure --help: --with-zlib=DIR Include ZLIB support (requires zlib >= 1.0.9) --with-zlib-

首页

博学

6Ren·AI

商城

Python 和 zlib : Terribly slow decompressing concatenated streams