compression - 压缩一组大整数-6ren

compression - 压缩一组大整数

转载作者：行者123 更新时间：2023-12-03 22:30:16

我有一组整数，我希望有最紧凑的表示。
我有以下限制/功能:

它被设置，或者换句话说，一个唯一整数的列表，其中的顺序无关紧要

集合 L 的大小相对较小(通常为 1000 个元素)

整数遵循 0 和 N-1 之间的均匀分布，其中 N 相对较大(比如 2^32)

对压缩集元素的访问是随机的，但如果解压过程不是那么快也没关系

压缩应该是无损的，显然

我尝试了一些事情，但我对结果并不满意，而且我以某种方式确信存在更好的解决方案:

增量编码(排序，然后编码差异)，或者也排序，然后编码第 i 个元素和 i*N/L 之间的差异。两者都给出了合理的结果，但不是很好，可能是因为 N 和 L 的典型大小。霍夫曼编码增量没有帮助，因为它们通常很大。

递归范围缩减 ( http://ygdes.com/ddj-3r/ddj-3r_compact.html )。这看起来很聪明，但在指数递减的整数上效果最好，这里绝对不是这种情况。

这里关于 stackoverflow 的一些讨论是相似的，但并不完全等同于我的问题( C Library for compressing sequential positive integers 、 Compress sorted integers )

我很高兴听到您的任何想法。提前致谢!

更新:

事实证明，增量编码似乎近似于最佳解决方案。对于集合中元素的其他其他分布，这可能不同。

最佳答案

你可以通过数数来了解你能做的最好的事情。 (我希望 stackoverflow 允许像 math.stackexchange 这样的 TeX 方程。无论如何......)

ceiling(log(Combination(2^32,1000)) / (8 * log(2))) = 2934

因此，如果如您所说，选择是均匀分布的，那么对于该特定情况，您希望平均的最佳压缩为 2934 字节。最佳比例是 4000 字节未编码表示的 73.35%。
Combination(2^32,1000)只是压缩算法的可能输入的总数。如果这些是均匀分布的，那么最佳编码是一个巨大的整数，它通过索引标识每个可能的输入。每个巨大的整数值唯一标识一个输入。想象一下在一个巨大的表中按索引查找输入。 ceiling(log(Combination(2^32,1000)) / log(2))是该索引整数需要多少位。

更新:

我找到了一种使用现成压缩工具接近理论最佳值的方法。我排序，应用增量编码，并从中减去一个(因为连续不同元素之间的增量至少是一个)。然后诀窍是我写出所有的高字节，然后是下一个最重要的字节，等等。 deltas 的高字节减去 1 往往为零，因此将许多零组合在一起，这是标准压缩实用程序所喜欢的.此外，下一组字节往往偏向于低值。

对于示例(来自 0..2^32-1 的 1000 个统一且不同的样本)，通过 gzip -9 运行时，我平均得到 3110 个字节。 , 和 3098 字节到 xz -9 (xz 使用与 7zip 相同的压缩 LZMA)。这些非常接近理论最佳平均值 2934。此外，gzip 的开销为 18 字节，而 xz 的开销为 24 字节，无论是头部还是尾部。因此，与理论最佳值进行更公平的比较是 gzip -9 为 3092和 3074 为 xz -9 .比理论最佳值大约 5%。

更新 2:

我实现了排列的直接编码，平均达到了 2974 字节，仅比理论最佳值多出 1% 多一点。我用了 GNU multiple precision arithmetic library将每个排列的索引编码为一个巨大的整数。编码和解码的实际代码如下所示。我为 mpz_* 添加了评论从名称上看它们正在执行的算术运算可能并不明显的函数。

/* Recursively code the members in set[] between low and high (low and high
   themselves have already been coded).  First code the middle member 'mid'.
   Then recursively code the members between low and mid, and then between mid
   and high. */
local void combination_encode_between(mpz_t pack, mpz_t base,
                                      const unsigned long *set,
                                      int low, int high)
{
    int mid;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately (also in that case, verify that set[] is sorted
       in ascending order) */
    mid = (low + high) >> 1;
    if (mid == low) {
        assert(set[low] < set[high]);
        return;
    }

    /* code set[mid] into pack, and update base with the number of possible
       set[mid] values between set[low] and set[high] for the next coded
       member */
        /* pack += base * (set[mid] - set[low] - 1) */
    mpz_addmul_ui(pack, base, set[mid] - set[low] - 1);
        /* base *= set[high] - set[low] - 1 */
    mpz_mul_ui(base, base, set[high] - set[low] - 1);

    /* code the rest between low and high */
    combination_encode_between(pack, base, set, low, mid);
    combination_encode_between(pack, base, set, mid, high);
}

/* Encode the set of integers set[0..num-1], where each element is a unique
   integer in the range 0..max.  No value appears more than once in set[]
   (hence the name "set").  The elements of set[] must be sorted in ascending
   order. */
local void combination_encode(mpz_t pack, const unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t base;

    /* handle degenerate cases and verify last member <= max -- code set[0]
       into pack as simply itself and set base to the number of possible set[0]
       values for coding the next member */
    if (num < 1) {
            /* pack = 0 */
        mpz_set_ui(pack, 0);
        return;
    }
        /* pack = set[0] */
    mpz_set_ui(pack, set[0]);
    if (num < 2) {
        assert(set[0] <= max);
        return;
    }
    assert(set[num - 1] <= max);
        /* base = max - num + 2 */
    mpz_init_set_ui(base, max - num + 2);

    /* code the last member of the set and update base with the number of
       possible last member values */
        /* pack += base * (set[num - 1] - set[0] - 1) */
    mpz_addmul_ui(pack, base, set[num - 1] - set[0] - 1);
        /* base *= max - set[0] */
    mpz_mul_ui(base, base, max - set[0]);

    /* encode the members between 0 and num - 1 */
    combination_encode_between(pack, base, set, 0, num - 1);
    mpz_clear(base);
}

/* Recursively decode the members in set[] between low and high (low and high
   themselves have already been decoded).  First decode the middle member
   'mid'. Then recursively decode the members between low and mid, and then
   between mid and high. */
local void combination_decode_between(mpz_t unpack, unsigned long *set,
                                      int low, int high)
{
    int mid;
    unsigned long rem;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately */
    mid = (low + high) >> 1;
    if (mid == low)
        return;

    /* extract set[mid] as the remainder of dividing unpack by the number of
       possible set[mid] values, update unpack with the quotient */
        /* div = set[high] - set[low] - 1, rem = unpack % div, unpack /= div */
    rem = mpz_fdiv_q_ui(unpack, unpack, set[high] - set[low] - 1);
    set[mid] = set[low] + 1 + rem;

    /* decode the rest between low and high */
    combination_decode_between(unpack, set, low, mid);
    combination_decode_between(unpack, set, mid, high);
}

/* Decode from pack the set of integers encoded by combination_encode(),
   putting the result in set[0..num-1].  max must be the same value used when
   encoding. */
local void combination_decode(const mpz_t pack, unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t unpack;
    unsigned long rem;

    /* handle degnerate cases, returning the value of pack as the only element
       for num == 1 */
    if (num < 1)
        return;
    if (num < 2) {
            /* set[0] = (unsigned long)pack */
        set[0] = mpz_get_ui(pack);
        return;
    }

    /* extract set[0] as the remainder after dividing pack by the number of
       possible set[0] values, set unpack to the quotient */
    mpz_init(unpack);
        /* div = max - num + 2, set[0] = pack % div, unpack = pack / div */
    set[0] = mpz_fdiv_q_ui(unpack, pack, max - num + 2);

    /* extract the last member as the remainder after dividing by the number
       of possible values, taking into account the first member -- update
       unpack with the quotient */
        /* rem = unpack % max - set[0], unpack /= max - set[0] */
    rem = mpz_fdiv_q_ui(unpack, unpack, max - set[0]);
    set[num - 1] = set[0] + 1 + rem;

    /* decode the members between 0 and num - 1 */
    combination_decode_between(unpack, set, 0, num - 1);
    mpz_clear(unpack);
}

有 mpz_*函数用于将数字写入文件并读取它，或将数字导出到内存中的指定格式，然后将其导入回来。

关于compression - 压缩一组大整数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12157578/

文章推荐：具有数据种类的 GADT 上的 Haskell 模式匹配

文章推荐： jquery - 使用 Jquery 在某个索引处添加一个项目到下拉列表

文章推荐： jquery - document.ready 时没有警报消息 - jQuery

文章推荐：用于稀疏矩阵的 Fortran 90/95 库？

ruby-on-rails - 如何生成文件，然后使用 Heroku 压缩/压缩？
我有点想做 the reverse of this. 我不想解压缩并将收集文件添加到 S3 应用户要求: 生成一堆xml文件使用一些图像(托管在 s3 上的预先存在的图像)压缩 xml 文件下载
apache - 压缩/压缩 javascript 和 css 文件
将此添加到域的虚拟主机后 AddOutputFilterByType DEFLATE application/javascript text/javascript text/css 响应头不包含任何内
apache 压缩压缩 .js 和 .css 文件未压缩？
在 Apache Im 中，通过将以下内容添加到我的 .htaccess 文件来启用输出压缩: # compress text, html, javascript, css, xml: AddOutp
压缩 HTTP
是否可以以压缩格式将请求数据从浏览器发送到服务器？如果是，我们该怎么做？最佳答案压缩从浏览器发送到服务器的数据是不受 native 支持在浏览器中。您必须找到一种解决方法，使用客户端语言(可
JavaScript 压缩
我正在寻找可以压缩JavaScript源代码的工具。我发现一些网络工具只能删除空格字符？但也许存在更好的工具，可以压缩用户的函数名称、字段名称、删除未使用的字段等。最佳答案经常用来压缩JS代码的工
压缩/合并数字组合的算法
使用赛马博彩场景，假设我有许多单独的投注来预测比赛的前 4 名选手 (superfecta)。赌注如下... 1/2/3/4 1/2/3/5 1/2/4/3 1/2/4/5 1/2/5/3
SQL 2008 压缩
我是一名实习生，被要求对 SQL 2008 数据压缩进行一些研究。我们想将 Outlook 电子邮件的几个部分存储在一个表中。问题是我们想将整个电子邮件正文存储在一个字段中，然后又想压缩它。使用 Ch
php - 压缩/减小视频的文件大小
我目前有一个系统，用户可以在其中上传 MP4 文件，并且可以在移动设备上下载该文件。但有时，这些视频的大小超过 5MB，在我国，大多数人使用 2G。因此，下载大型视频通常需要 15-20 分钟。有什
sql - 压缩/重复连接？
假设我有一个带有类型列的简单文档表: Documents Id Type 1 A 2 A 3 B 4 C 5 C 6 A 7 A 8 A 9 B 10 C 用户
r - 压缩/汇总R中的字符串开始和长度数据
我有一个较大字符串中的(子)字符串位置的 data.frame。数据包含(子)字符串的开头及其长度。可以很容易地计算出(子)字符串的结束位置。 data1 start length end #>
encryption - 编码、压缩
我想知道是否文件加密算法可以设计成它也可以执行文件压缩的事件(任何活生生的例子？)。我也可以将它集成到移动短信服务中，我的意思是短信吗？另外我想知道二进制文件...如果纯文本文件以二进制编码
image - PNG 压缩
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 7年前关闭。 Improve thi
javascript - 压缩 JavaScript
我们有几个具有大量 JavaScript 的 Java 项目，目前我们使用的是旧版本的 YUICompressor (2.4.2)。然而，我在这篇博文中发现 YUICompressor 正在 depr
ASP.NET 压缩
从之前关于尝试提高网站性能的文章中，我一直在研究 HTTP 压缩。我读过有关在 IIS 中设置它的信息，但它似乎是所有 IIS 应用程序池的全局事物，我可能不允许这样做，因为还有另一个站点在其上运行。
WCF REST 压缩
我有一个 REST 服务，它返回一大块 XML，大约值(value) 150k。例如http://xmlservice.com/services/RestService.svc/GetLargeXM
javascript - UglifyJS 压缩
我正在尝试获取一个简单的 UglifyJS (v2.3.6) 示例来处理压缩。具体来说，“未使用”选项，如果从未使用过，变量和函数将被删除。这是我在命令行上的尝试: echo "function
c - ZLIB 压缩
我正在开发一个项目，如果我的磁盘出现问题，我将在使用 ZLIB 压缩内存块后将其发送到另一个磁盘。然后我计划下载该转储并用于进一步调试。这种压缩和上传将一次完成一个 block - 比如说 1024
c - LZW 压缩
LZW 压缩算法在压缩后增加了位大小: 这是压缩函数的代码: // compression void compress(FILE *inputFile, FILE *outputFile) {
c# - 压缩/加密的最佳实践
我的问题与如何在 3D 地形上存储大量信息有关。这些信息应该是 secret 的，因为它们非常庞大，也应该被压缩。我选择了文件存储，现在我想知道将对象数据加密/压缩(或压缩/加密)到文件的最佳做法。
java - 压缩/压缩android上的文件夹
我使用以下代码来压缩我的文件并且效果很好，但我只想压缩子文件夹而不是在压缩文件中显示树的根。 public boolean zipFileAtPath(String sourcePath, Strin

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

compression - 压缩一组大整数