gpt4 book ai didi

python-2.7 - 如何在python中创建为相同内容保留相同md5哈希的存档?

转载 作者:行者123 更新时间:2023-12-04 11:56:13 32 4
gpt4 key购买 nike

正如本文中的解释https://medium.com/@mpreziuso/is-gzip-deterministic-26c81bfd0a49压缩完全相同的一组文件的两个 .tar.gz 文件的 md5 可以不同。例如,这是因为它在压缩文件的标题中包含时间戳。

在文章中提出了 3 个解决方案,我希望使用第一个解决方案:

We can use the -n flag in gzip which will make gzip omit the timestamp and the file name from the file header;



这个解决方案效果很好:
tar -c ./bin |gzip -n >one.tar.gz
tar -c ./bin |gzip -n >two.tar.gz
md5sum one.tgz two.tgz

尽管如此,我不知道在 python 中做这件事的好方法是什么。
有没有办法用 tarfile( https://docs.python.org/2/library/tarfile.html ) 做到这一点?

最佳答案

Martin's answer是正确的,但在我的情况下,我也想忽略 tar 中每个文件的最后修改日期,这样即使文件被“修改”但没有实际更改,它仍然具有相同的哈希值。

创建 tar 时,我可以覆盖我不关心的值,因此它们始终相同。

在这个例子中,我展示了只使用普通的 tar.bz2,如果我用新的创建时间戳重新创建我的源文件,哈希值会改变(1 和 2 相同,重新创建后,4 将不同)。但是,如果我将时间设置为 Unix Epoch 0(或任何其他任意时间),我的文件的哈希值将相同(3、5 和 6)

为此,您需要传递 filter函数到 tar.add(DIR, filter=tarInfoStripFileAttrs)删除所需的字段,如下例所示

import tarfile, time, os

def createTestFile():
with open(DIR + "/someFile.txt", "w") as file:
file.write("test file")

# Takes in a TarInfo and returns the modified TarInfo:
# https://docs.python.org/3/library/tarfile.html#tarinfo-objects
# intented to be passed as a filter to tarfile.add
# https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.add
def tarInfoStripFileAttrs(tarInfo):
# set time to epoch timestamp 0, aka 00:00:00 UTC on 1 January 1970
# note that when extracting this tarfile, this time will be shown as the modified date
tarInfo.mtime = 0

# file permissions, probably don't want to remove this, but for some use cases you could
# tarInfo.mode = 0

# user/group info
tarInfo.uid= 0
tarInfo.uname = ''
tarInfo.gid= 0
tarInfo.gname = ''

# stripping paxheaders may not be required
# see https://stackoverflow.com/questions/34688392/paxheaders-in-tarball
tarInfo.pax_headers = {}

return tarInfo


# COMPRESSION_TYPE = "gz" # does not work even with filter
COMPRESSION_TYPE = "bz2"
DIR = "toTar"
if not os.path.exists(DIR):
os.mkdir(DIR)

createTestFile()

tar1 = tarfile.open("one.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar1.add(DIR)
tar1.close()

tar2 = tarfile.open("two.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar2.add(DIR)
tar2.close()

tar3 = tarfile.open("three.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar3.add(DIR, filter=tarInfoStripFileAttrs)
tar3.close()

# Overwrite the file with the same content, but an updated time
time.sleep(1)
createTestFile()

tar4 = tarfile.open("four.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar4.add(DIR)
tar4.close()


tar5 = tarfile.open("five.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar5.add(DIR, filter=tarInfoStripFileAttrs)
tar5.close()

tar6 = tarfile.open("six.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar6.add(DIR, filter=tarInfoStripFileAttrs)
tar6.close()
$ md5sum one.tar.bz2 two.tar.bz2 three.tar.bz2 four.tar.bz2 five.tar.bz2 six.tar.bz2
0e51c97a8810e45b78baeb1677c3f946 one.tar.bz2 # same as 2
0e51c97a8810e45b78baeb1677c3f946 two.tar.bz2 # same as 1
54a38d35d48d4aa1bd68e12cf7aee511 three.tar.bz2 # same as 5/6
22cf1161897377eefaa5ba89e3fa6acd four.tar.bz2 # would be same as 1/2, but timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511 five.tar.bz2 # same as 3, even though timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511 six.tar.bz2 # same as 3, even though timestamp has changed

您可能想要根据您的用例调整哪些参数被修改以及如何在您的过滤器函数中进行修改。

关于python-2.7 - 如何在python中创建为相同内容保留相同md5哈希的存档?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45035782/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com