gpt4 book ai didi

java - 处理给定数据集时,如何为 zlib 'setDictionary' 找到好的/最佳字典?

转载 作者:太空狗 更新时间:2023-10-29 22:35:18 26 4
gpt4 key购买 nike

我有一组(巨大的)相似的数据文件。该集不断增长。单个文件的大小约为10K。每个文件都必须单独压缩。压缩是通过 zlib 库完成的,该库由 java.util.zip.Deflater 类使用。使用 setDictionary 将字典传递给 Deflate 算法时,我可以提高压缩率。

有没有办法(算法)找到“最佳”字典,即具有整体最佳压缩比的字典?

参见 zlib manual

最佳答案

约翰·雷泽 explained on comp.compression :

For the dictionary: make a histogram of short substrings, sort by payoff (number of occurrences times number of bits saved when compressed) and put the highest-payoff substrings into the dictionary. For example, if k is the length of the shortest substring that can be compressed (usually 3==k or 2==k), then make a histogram of all the substrings of lengths k, 1+k, 2+k, and 3+k. Of course there is some art to placing those substrings into the dictionary, taking advantage of substrings, overlapping, short strings nearer to the high-address end, etc.

The Linux kernel uses a similar technique to compress names of symbols that are used for printing backtraces of the subroutine calling stack. See the file scripts/kallsyms.c. For instance, https://code.woboq.org/linux/linux/scripts/kallsyms.c.html

zlib manual recommends将最常见的事件放在字典的末尾。

The dictionary should consist of strings (byte sequences) that are likely to be encountered later in the data to be compressed, with the most commonly used strings preferably put towards the end of the dictionary. Using a dictionary is most useful when the data to be compressed is short and can be predicted with good accuracy; the data can then be compressed better than with the default empty dictionary.

这是因为 LZ77 有一个滑动窗口算法,所以后面的子串在你的数据流上比前几个更远。

我会尝试使用具有良好字符串支持的高级语言生成字典。一个粗略的 JavaScript 示例:

var str = "The dictionary should consist of strings (byte sequences) that"
+ " are likely to be encountered later in the data to be compressed,"
+ " with the most commonly used strings preferably put towards the "
+ "end of the dictionary. Using a dictionary is most useful when the"
+ " data to be compressed is short and can be predicted with good"
+ " accuracy; the data can then be compressed better than with the "
+ "default empty dictionary.";
// Extract words, remove punctuation (extra: replace(/\s/g, " "))
var words = str.replace(/[,\;.:\(\)]/g, "").split(" ").sort();
var wcnt = [], w = "", cnt = 0; // pairs, current word, current word count
for (var i = 0, cnt = 0, w = ""; i < words.length; i++) {
if (words[i] === w) {
cnt++; // another match
} else {
if (w !== "")
wcnt.push([cnt, w]); // Push a pair (count, word)
cnt = 1; // Start counting for this word
w = words[i]; // Start counting again
}
}
if (w !== "")
wcnt.push([cnt, w]); // Push last word
wcnt.sort(); // Greater matches at the end
for (var i in wcnt)
wcnt[i] = wcnt[i][1]; // Just take the words
var dict = wcnt.join("").slice(-70); // Join the words, take last 70 chars

然后 dict 是一个包含 70 个字符的字符串:

rdsusedusefulwhencanismostofstringscompresseddatatowithdictionarybethe

你可以试试复制粘贴运行here (添加:“打印(字典)”)

这只是整个单词,而不是子字符串。还有一些方法可以重叠公共(public)子字符串以节省字典空间。

关于java - 处理给定数据集时,如何为 zlib 'setDictionary' 找到好的/最佳字典?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2011653/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com