gpt4 book ai didi

algorithm - smaz 压缩库如何工作?

转载 作者:可可西里 更新时间:2023-11-01 11:13:24 26 4
gpt4 key购买 nike

我目前正在为一个基于我的语言的短文本压缩项目工作。但作为初学者,我也知道一些基本的压缩算法,比如 LZW。但是我还是不明白smaz作品。我有两个问题:

  1. smaz 是如何运作的?
  2. 如何构建密码本和反向密码本?

谁能帮我解释一下?

非常感谢。

最佳答案

试着回答你的问题

smaz 是如何工作的?根据[1] ,

Smaz has a hard-wired constant built-in codebook of 254 common English words, word fragments, bigrams, and the lowercase letters (except j, k, q). The inner loop of the Smaz decoder is very simple:

  • Fetch the next byte X from the compressed file.
    1. Is X == 254? Single byte literal: fetch the next byte L, and pass it straight through to the decoded text.
    2. Is X == 255? Literal string: fetch the next byte L, then pass the following L+1 bytes straight through to the decoded text.
    3. Any other value of X: lookup the X'th "word" in the codebook (that "word" can be from 1 to 5 letters), and copy that word to the decoded text.
  • Repeat until there are no more compressed bytes left in the compressed file.

Because the codebook is constant, the Smaz decoder is unable to "learn" new words and compress them, no matter how often they appear in the original text.

page可能有助于理解代码。

如何构建密码本和反向密码本? TODO存储库和作者中的文件 comments在 redit 中,字典是由未发布的 ruby​​ 脚本生成的。另外,作者解释说:

btw what the Ruby program does is to consider all the possible substrings, and even all the possible separated words, and build a table of frequencies, than adjust the weight based on the string length, and finally hand tuning the table to compress specific things very well. I added by hand the "http://" and ".com" token for example, removing the final two entries.

您的项目的替代方案可以是 shoco library它支持根据您的语言生成自定义压缩模型。

关于algorithm - smaz 压缩库如何工作?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33331552/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com