gpt4 book ai didi

algorithm - 使用预计算字典的小块无损压缩

转载 作者:塔克拉玛干 更新时间:2023-11-03 02:20:11 27 4
gpt4 key购买 nike

我有一个应用程序,我在其中读取和写入小块数据(几百字节)数亿次。我想根据示例数据文件生成一个压缩字典,并在我读写小块时永远使用该字典。我倾向于 LZW 压缩算法。维基百科页面 ( http://en.wikipedia.org/wiki/Lempel-Ziv-Welch ) 列出了压缩和解压缩的伪代码。修改它看起来相当简单,这样字典创建是一个单独的代码块。所以我有两个问题:

  1. 我走在正确的轨道上还是有更好的方法?
  2. 为什么 LZW 算法在解压步骤中添加到字典中?我可以忽略它吗,否则我的字典会失去效率吗?

谢谢。

更新:现在我认为理想的情况是找到一个库,让我可以将字典与压缩数据分开存储。有没有这样的东西?

更新:我最终在 http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c 获取了代码并对其进行调整。在该页面的评论中,我是 Chris。我通过电子邮件将我的模组发回给了那个博客作者,但我还没有收到回音。我看到该代码的压缩率一点也不令人印象深刻。也许这是由于 8 位树的大小。

更新:我将它转换为 16 位,压缩效果更好。它也比原始代码快得多。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace Book.Core
{
public class Huffman16
{
private readonly double log2 = Math.Log(2);

private List<Node> HuffmanTree = new List<Node>();

internal class Node
{
public long Frequency { get; set; }
public byte Uncoded0 { get; set; }
public byte Uncoded1 { get; set; }
public uint Coded { get; set; }
public int CodeLength { get; set; }
public Node Left { get; set; }
public Node Right { get; set; }

public bool IsLeaf
{
get { return Left == null; }
}

public override string ToString()
{
var coded = "00000000" + Convert.ToString(Coded, 2);
return string.Format("Uncoded={0}, Coded={1}, Frequency={2}", (Uncoded1 << 8) | Uncoded0, coded.Substring(coded.Length - CodeLength), Frequency);
}
}

public Huffman16(long[] frequencies)
{
if (frequencies.Length != ushort.MaxValue + 1)
{
throw new ArgumentException("frequencies.Length must equal " + ushort.MaxValue + 1);
}
BuildTree(frequencies);
EncodeTree(HuffmanTree[HuffmanTree.Count - 1], 0, 0);
}

public static long[] GetFrequencies(byte[] sampleData, bool safe)
{
if (sampleData.Length % 2 != 0)
{
throw new ArgumentException("sampleData.Length must be a multiple of 2.");
}
var histogram = new long[ushort.MaxValue + 1];
if (safe)
{
for (int i = 0; i <= ushort.MaxValue; i++)
{
histogram[i] = 1;
}
}
for (int i = 0; i < sampleData.Length; i += 2)
{
histogram[(sampleData[i] << 8) | sampleData[i + 1]] += 1000;
}
return histogram;
}

public byte[] Encode(byte[] plainData)
{
if (plainData.Length % 2 != 0)
{
throw new ArgumentException("plainData.Length must be a multiple of 2.");
}

Int64 iBuffer = 0;
int iBufferCount = 0;

using (MemoryStream msEncodedOutput = new MemoryStream())
{
//Write Final Output Size 1st
msEncodedOutput.Write(BitConverter.GetBytes(plainData.Length), 0, 4);

//Begin Writing Encoded Data Stream
iBuffer = 0;
iBufferCount = 0;
for (int i = 0; i < plainData.Length; i += 2)
{
Node FoundLeaf = HuffmanTree[(plainData[i] << 8) | plainData[i + 1]];

//How many bits are we adding?
iBufferCount += FoundLeaf.CodeLength;

//Shift the buffer
iBuffer = (iBuffer << FoundLeaf.CodeLength) | FoundLeaf.Coded;

//Are there at least 8 bits in the buffer?
while (iBufferCount > 7)
{
//Write to output
int iBufferOutput = (int)(iBuffer >> (iBufferCount - 8));
msEncodedOutput.WriteByte((byte)iBufferOutput);
iBufferCount = iBufferCount - 8;
iBufferOutput <<= iBufferCount;
iBuffer ^= iBufferOutput;
}
}

//Write remaining bits in buffer
if (iBufferCount > 0)
{
iBuffer = iBuffer << (8 - iBufferCount);
msEncodedOutput.WriteByte((byte)iBuffer);
}
return msEncodedOutput.ToArray();
}
}

public byte[] Decode(byte[] bInput)
{
long iInputBuffer = 0;
int iBytesWritten = 0;

//Establish Output Buffer to write unencoded data to
byte[] bDecodedOutput = new byte[BitConverter.ToInt32(bInput, 0)];

var current = HuffmanTree[HuffmanTree.Count - 1];

//Begin Looping through Input and Decoding
iInputBuffer = 0;
for (int i = 4; i < bInput.Length; i++)
{
iInputBuffer = bInput[i];

for (int bit = 0; bit < 8; bit++)
{
if ((iInputBuffer & 128) == 0)
{
current = current.Left;
}
else
{
current = current.Right;
}
if (current.IsLeaf)
{
bDecodedOutput[iBytesWritten++] = current.Uncoded1;
bDecodedOutput[iBytesWritten++] = current.Uncoded0;
if (iBytesWritten == bDecodedOutput.Length)
{
return bDecodedOutput;
}
current = HuffmanTree[HuffmanTree.Count - 1];
}
iInputBuffer <<= 1;
}
}
throw new Exception();
}

private static void EncodeTree(Node node, int depth, uint value)
{
if (node != null)
{
if (node.IsLeaf)
{
node.CodeLength = depth;
node.Coded = value;
}
else
{
depth++;
value <<= 1;
EncodeTree(node.Left, depth, value);
EncodeTree(node.Right, depth, value | 1);
}
}
}

private void BuildTree(long[] frequencies)
{
var tiny = 0.1 / ushort.MaxValue;
var fraction = 0.0;

SortedDictionary<double, Node> trees = new SortedDictionary<double, Node>();
for (int i = 0; i <= ushort.MaxValue; i++)
{
var leaf = new Node()
{
Uncoded1 = (byte)(i >> 8),
Uncoded0 = (byte)(i & 255),
Frequency = frequencies[i]
};
HuffmanTree.Add(leaf);
if (leaf.Frequency > 0)
{
trees.Add(leaf.Frequency + (fraction += tiny), leaf);
}
}

while (trees.Count > 1)
{
var e = trees.GetEnumerator();
e.MoveNext();
var first = e.Current;
e.MoveNext();
var second = e.Current;

//Join smallest two nodes
var NewParent = new Node();
NewParent.Frequency = first.Value.Frequency + second.Value.Frequency;
NewParent.Left = first.Value;
NewParent.Right = second.Value;

HuffmanTree.Add(NewParent);

//Remove the two that just got joined into one
trees.Remove(first.Key);
trees.Remove(second.Key);

trees.Add(NewParent.Frequency + (fraction += tiny), NewParent);
}
}

}

}

使用示例:

从样本数据创建字典:

var freqs = Huffman16.GetFrequencies(File.ReadAllBytes(@"D:\nodes"), true);

用给定的字典初始化编码器:

var huff = new Huffman16(freqs);

并进行一些压缩:

var encoded = huff.Encode(raw);

和解压:

var raw = huff.Decode(encoded);

最佳答案

在我看来,最困难的部分是如何构建静态字典。您不想使用从示例数据构建的 LZW 字典。 LZW 浪费了很多时间学习,因为它不能比解压器更快地构建字典(一个标记只会在压缩器第二次看到它时使用,所以解压器可以在第一次看到它时将它添加到它的字典中) .这样做的另一面是它向字典中添加了可能永远不会用到的东西,以防字符串再次出现。 (例如,要获得“stackoverflow”的标记,您还需要“ac”、“ko”、“ve”、“rf”等条目...)

但是,从 LZ77 算法查看原始 token 流可能效果很好。您只会看到至少出现两次的字符串的标记。然后,您可以构建一个最常见的标记/字符串列表以包含在您的字典中。

一旦你有一个静态字典,使用 LZW sans 字典更新似乎是一个简单的实现,但为了获得最好的压缩,我会考虑静态 Huffman表而不是传统的 12 位固定大小 token (如 George Phillips 建议的那样)。 LZW 字典将为您可能从未实际编码的所有子字符串燃烧标记(例如,如果您可以编码“stackoverflow”,则会有“st”、“sta”、“stac”、“stack”、“stacko' 等)。

在这一点上它真的不是 LZW - LZW 的聪明之处在于解压缩器如何构建压缩器使用的相同字典只看到压缩数据流。你不会使用的东西。但是所有 LZW 实现都有一个字典已满且不再更新的状态,这就是您将它与静态字典一起使用的方式。

关于algorithm - 使用预计算字典的小块无损压缩,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1072242/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com