algorithm - 压缩具有特定顺序的正整数向量 (int32)-6ren

algorithm - 压缩具有特定顺序的正整数向量 (int32)

转载作者：行者123 更新时间：2023-12-04 04:27:06

我正在尝试压缩长向量(它们的大小范围从 1 到 1 亿个元素)。向量具有正整数，其值范围从 0 到 1 或 1 亿(取决于向量大小)。因此，我使用 32 位整数来包含大数字，但这会消耗太多存储空间。
这些向量具有以下特征:

所有值都是正整数。它们的范围随着向量大小的增长而增长。

值在增加，但较小的数字确实经常干预(见下图)。

特定索引之前的值都不大于该索引(索引从零开始)。例如，索引 6 之前出现的值都不大于 6。但是，较小的值可能会在该索引之后重复。这适用于整个阵列。

我通常处理很长的数组。因此，当数组长度超过 100 万个元素时，即将出现的数字大多是与先前重复出现的数字混合的大数字。较短的数字通常比较大的数字更频繁地重新出现。当您通过数组时，新的更大的数字会添加到数组中。

以下是数组中的值示例:{initial padding..., 0, 1, 2, 3, 4, 5, 6, 4, 7, 4, 8, 9, 1, 10, ... later ..., 1110, 11, 1597, 1545, 1392, 326, 1371, 1788, 541,...}
这是向量的一部分的图:

我想要什么？ :
因为我使用的是 32 位整数，这浪费了大量内存，因为可以用小于 32 位表示的较小数字也会重复。我想最大限度地压缩这个向量以节省内存(理想情况下，减少 3 倍，因为只有减少这个数量或更多才能满足我们的需求!)。实现这一目标的最佳压缩算法是什么？或者是否可以利用上述数组的特征将该数组中的数字可逆地转换为 8 位整数？
我尝试过或考虑过的事情 :

Delta 编码:这在这里不起作用，因为矢量并不总是增加。

霍夫曼编码:这里似乎没有帮助，因为数组中唯一数字的范围非常大，因此，编码表将是一个很大的开销。

使用变量 Int 编码。即对较小的数字使用 8 位整数，对较大的数字使用 16 位......等等。这将向量大小减小到 size*0.7(不令人满意，因为它没有利用上述特定特性)

我不太确定以下链接中描述的这种方法是否适用于我的数据:http://ygdes.com/ddj-3r/ddj-3r_compact.html
我不太了解该方法，但它鼓励我尝试类似的事情，因为我认为数据中有一些顺序可以发挥其优势。
例如，我尝试将任何大于 255 的数字(n)重新分配给 n-255，以便我可以将整数保留在 8 位领域中，因为我知道在该索引之前没有任何数字大于 255。但是，我无法区分重新分配的数字和重复的数字......所以这个想法不起作用，除非做一些更多的技巧来逆转重新分配......

以下是感兴趣的人的数据的前 24000 个元素的链接:
data
任何意见或建议深表感谢。非常感谢。
编辑 1:
这是增量编码后的数据图。如您所见，它不会减少范围!

编辑 2:
我希望我能在数据中找到一种模式，允许我可逆地将 32 位向量更改为单个 8 位向量，但这似乎不太可能。
我尝试将 32 位向量分解为 4 x 8 位向量，希望分解后的向量能更好地进行压缩。
下面是 4 个向量的图。现在它们的范围是 0-255。
我所做的是递归地将向量中的每个元素除以 255 并将提醒存储到另一个向量中。要重建原始数组，我需要做的就是: ( ( ( (vec4*255) + vec3 )*255 + vec2 ) *255 + vec1...

如您所见，对于当前显示的数据长度，最后一个向量全为零..实际上，这应该一直为零到第 2^24 个元素。如果我的总向量长度小于 1600 万个元素，这将减少 25%，但由于我处理的是更长的向量，因此影响要小得多。
更重要的是，第三个向量似乎也有一些可压缩的特征，因为它的值在每 65,535 步后增加 1。
现在看来，我可以按照建议从霍夫曼编码或可变位编码中受益。任何能让我最大限度地压缩这些数据的建议都深表感谢。
如果有人感兴趣，我在这里附上了一个更大的数据样本:
https://drive.google.com/file/d/10wO3-1j3NkQbaKTcr0nl55bOH9P-G1Uu/view?usp=sharing
编辑 3:
我真的很感谢所有给出的答案。我从他们身上学到了很多。对于那些有兴趣修改更大数据集的人，以下链接包含类似数据集的 1100 万个元素(压缩 33MB)
https://drive.google.com/file/d/1Aohfu6II6OdN-CqnDll7DeHPgEDLMPjP/view
解压缩数据后，您可以使用以下 C++ 代码段将数据读入 vector

    const char* path = "path_to\compression_int32.txt";
    std::vector<int32_t> newVector{};
    std::ifstream ifs(path, std::ios::in | std::ifstream::binary);
    std::istream_iterator<int32_t> iter{ ifs };
    std::istream_iterator<int32_t> end{};
    std::copy(iter, end, std::back_inserter(newVector));

最佳答案

通过使用属性 3，很容易在示例数据上获得比两倍压缩更好的效果，其中我采用属性 3 表示每个值都必须小于其索引，索引从 1 开始。只需使用天花板(log2 (i)) 位来存储索引 i 处的数字(其中 i 从 1 开始)。对于具有 24,977 个值的第一个示例，使用 32 位整数将其压缩为向量大小的 43%。
所需的位数仅取决于向量的长度 n。位数为:
1 - 2ceiling(log2(n)) + n 天花板(log2(n))
正如 Falk Hüffner 所指出的，一种更简单的方法是为所有的天花板值(log2(n))使用固定数量的比特。可变位数将始终小于该数值，但不会比大 n 的位数少多少。
如果在开始时经常出现一串零，那么用计数压缩它们。余数中只有少数两个或三个数字的运行，因此除了最初的零运行之外，运行长度编码将无济于事。
使用算术编码方法可以减少另外 2% 左右(对于大型集合)，将索引 k 处的每个值(从零开始的索引)视为非常大整数的基数 k+1 位。这将需要天花板(log2(n!))位。
这是算术编码的压缩比、每个样本编码的可变位和每个样本编码的固定位的图，所有这些都与每个样本的 32 位表示(序列长度在对数刻度上)成比例:

算术方法需要对压缩数据长度的整数进行乘法和除法，这对于大向量来说非常慢。下面的代码将整数的大小限制为 64 位，以压缩比为代价，以换取它非常快。这段代码的压缩率比上图中的算术高约 0.2% 到 0.7%，远低于可变位。数据向量必须具有每个值都是非负的属性
并且每个值都小于其位置(从 1 开始的位置)。
压缩效果仅取决于该属性，如果初始运行为零，则还会有少量减少。
在提供的例子中似乎有更多的冗余，这
压缩方法不利用。

#include <vector>
#include <cmath>

// Append val, as a variable-length integer, to comp. val must be non-negative.
template <typename T>
void write_varint(T val, std::vector<uint8_t>& comp) {
    while (val > 0x7f) {
        comp.push_back(val & 0x7f);
        val >>= 7;
    }
    comp.push_back(val | 0x80);
}

// Return the variable-length integer at offset off in comp, updating off to
// point after the integer.
template <typename T>
T read_varint(std::vector<uint8_t> const& comp, size_t& off) {
    T val = 0, next;
    int shift = 0;
    for (;;) {
        next = comp.at(off++);
        if (next > 0x7f)
            break;
        val |= next << shift;
        shift += 7;
    }
    val |= (next & 0x7f) << shift;
    return val;
}

// Given the starting index i >= 1, find the optimal number of values to code
// into 64 bits or less, or up through index n-1, whichever comes first.
// Optimal is defined as the least amount of entropy lost by representing the
// group in an integral number of bits, divided by the number of bits. Return
// the optimal number of values in num, and the number of bits needed to hold
// an integer representing that group in len.
static void group_ar64(size_t i, size_t n, size_t& num, int& len) {
    // Analyze all of the permitted groups, starting at index i.
    double min = 1.;
    uint64_t k = 1;                 // integer range is 0..k-1
    auto j = i + 1;
    do {
        k *= j;
        auto e = log2(k);           // entropy of k possible integers
        int b = ceil(e);            // number of bits to hold 0..k-1
        auto loss = (b - e) / b;    // unused entropy per bit
        if (loss < min) {
            num = j - i;            // best number of values so far
            len = b;                // bit length for that number
            if (loss == 0.)
                break;              // not going to get any better
            min = loss;
        }
    } while (j < n && k <= (uint64_t)-1 / ++j);
}

// Compress the data arithmetically coded as an incrementing base integer, but
// with a 64-bit limit on each integer. This puts values into groups that each
// fit in 64 bits, with the least amount of wasted entropy. Also compress the
// initial run of zeros into a count.
template <typename T>
std::vector<uint8_t> compress_ar64(std::vector<T> const& data) {
    // Resulting compressed data vector.
    std::vector<uint8_t> comp;

    // Start with number of values to make the stream self-terminating.
    write_varint(data.size(), comp);
    if (data.size() == 0)
        return comp;

    // Run-length code the initial run of zeros. Write the number of contiguous
    // zeros after the first one.
    size_t i = 1;
    while (i < data.size() && data[i] == 0)
        i++;
    write_varint(i - 1, comp);

    // Compress the data into variable-base integers starting at index i, where
    // each integer fits into 64 bits.
    unsigned buf = 0;       // output bit buffer
    int bits = 0;           // number of bits in buf (0..7)
    while (i < data.size()) {
        // Find the optimal number of values to code, starting at index i.
        size_t num;  int len;
        group_ar64(i, data.size(), num, len);

        // Code num values.
        uint64_t code = 0;
        size_t k = 1;
        do {
            code += k * data[i++];
            k *= i;
        } while (--num);

        // Write code using len bits.
        if (bits) {
            comp.push_back(buf | (code << bits));
            code >>= 8 - bits;
            len -= 8 - bits;
        }
        while (len > 7) {
            comp.push_back(code);
            code >>= 8;
            len -= 8;
        }
        buf = code;
        bits = len;
    }
    if (bits)
        comp.push_back(buf);
    return comp;
}

// Decompress the result of compress_ar64(), returning the original values.
// Start decompression at offset off in comp. When done, off is updated to
// point just after the compressed data.
template <typename T>
std::vector<T> expand_ar64(std::vector<uint8_t> const& comp, size_t& off) {
    // Will contain the uncompressed data to return.
    std::vector<T> data;

    // Get the number of values.
    auto vals = read_varint<size_t>(comp, off);
    if (vals == 0)
        return data;

    // Get the number of zeros after the first one, and write all of them.
    auto run = read_varint<size_t>(comp, off) + 1;
    auto i = run;
    do {
        data.push_back(0);
    } while (--run);

    // Extract the values from the compressed data starting at index i.
    unsigned buf = 0;       // input bit buffer
    int bits = 0;           // number of bits in buf (0..7)
    while (i < vals) {
        // Find the optimal number of values to code, starting at index i. This
        // simply repeats the same calculation that was done when compressing.
        size_t num;  int len;
        group_ar64(i, vals, num, len);

        // Read len bits into code.
        uint64_t code = buf;
        while (bits + 8 < len) {
            code |= (uint64_t)comp.at(off++) << bits;
            bits += 8;
        }
        len -= bits;                    // bits to pull from last byte (1..8)
        uint64_t last = comp.at(off++); // last byte
        code |= (last & ((1 << len) - 1)) << bits;
        buf = last >> len;              // save remaining bits in buffer
        bits = 8 - len;

        // Extract num values from code.
        do {
            i++;
            data.push_back(code % i);
            code /= i;
        } while (--num);
    }

    // Return the uncompressed data.
    return data;
}

关于algorithm - 压缩具有特定顺序的正整数向量 (int32)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/67943077/

文章推荐： scala - 为什么Web开发框架倾向于围绕语言的静态功能工作？

文章推荐： parse-platform - 使用 Parse Cloud 的 Twilio SMS 验证未交付

文章推荐： c++ - 在编译时使用 C++17 可变模板遍历树

java - float(具有 4 个字节的内存)可以在 Java 中保存 long(具有 8 个字节的内存)值。如何？
这是代码片段。请说出这种用小内存存储大数据的算法是什么。 public static void main(String[] args) { long longValue = 21474836
php - 当 Gmail IMAP 具有 utf8 而 Outlook 具有 ISO-8859-7 时，如何读取内容类型 header 并将其转换为 utf-8？
所以我使用 imap 从 gmail 和 outlook 接收电子邮件。 Gmail 像这样编码 =?UTF-8?B?UmU6IM69zq3OvyDOtc68zrHOuc67IG5ldyBlbWFpb
具有 2 个参数的计划过程
很久以前就学会了 C 代码；想用 Scheme 尝试一些新的和不同的东西。我正在尝试制作一个接受两个参数并返回两者中较大者的过程，例如 (define (larger x y) (if (> x
azure - 具有/不具有跨区域恢复的异地冗余恢复服务保管库有什么意义？
Azure 恢复服务保管库有两个备份配置选项 - LRS 与 GRS 这是一个有关 Azure 恢复服务保管库的问题。当其驻留区域发生故障时，如何处理启用异地冗余的恢复服务保管库？如果未为恢复服务启
hibernate - 具有@OneToMany属性的可嵌入实体
说，我有以下实体： @Entity public class A { @Id @GeneratedValue private Long id; @Embedded private
java - 具有 "in"运算符和空列表的条件
我有下一个问题。我有下一个标准: criteria.add(Restrictions.in("entity.otherEntity", getOtherEntitiesList())); 如果我的
Java - 具有 If 语句打印顺序错误的主方法
如果这是任何类型的重复，我会提前申请，但我找不到任何可以解决我的具体问题的内容。这是我的程序: import java.util.Random; public class CarnivalGame{
database - 具有$ setIntersection的Mongodb聚合管道
我目前正在使用golang创建一个聚合管道，在其中使用“$ or”运算符查询文档。结果是一堆需要分组的未分组文档，这样我就可以进入下一阶段，找到两个数据集之间的交集。然后将其用于在单独的集合中进行
java - 具有 Or 条件的正则表达式？
是否可以在正则表达式中创建 OR 条件。我正在尝试查找包含此类模式的文件名列表的匹配项第一个案例 xxxxx-hello.file 或者案例二 xxxx-hello-unasigned.file
c - 具有 `for` 循环的菱形输出
该程序只是在用户输入行数时创建菱形的形状，因此它有 6 个 for 循环； 3 个循环创建第一个三角形，3 个循环创建另一个三角形，通过这 2 个三角形和 6 个循环，我们得到了一个菱形，这是整个程序
c# - 具有 "&"的查询字符串值
我有一个像这样的查询字符串 www.google.com?Department=Education & Finance&Department=Health 我有这些 li 标签，它们的查询字符串是这样
c# - 具有/不同配置值的单元测试静态构造函数
我有一个带有静态构造函数的类，我用它来读取 app.config 值。如何使用不同的配置值对类进行单元测试。我正在考虑在不同的应用程序域中运行每个测试，这样我就可以为每个测试执行静态构造函数 - 但我
c++ - 具有 OR 搜索功能的多键容器
我正在寻找一个可以容纳多个键的容器，如果我为其中一个键值输入保留值(例如 0)，它会被视为“或”搜索。 map, int > myContainer; myContainer.insert(make_
mysql - 具有/多种类型的单个对象的关系表设计
我正在为 Web 应用程序创建数据库，并正在寻找一些建议来对可能具有多种类型的单个实体进行建模，每种类型具有不同的属性。作为示例，假设我想为“数据源”对象创建一个关系模型。所有数据源都会有一些共享属
arrays - 具有 IN 条件的存储过程语法
(1) =>CREATE TABLE T1(id BIGSERIAL PRIMARY KEY, name TEXT); CREATE TABLE (2) =>INSERT INTO T1 (name)
sql - 具有 AS 别名的不明确列引用
我不确定在使用别名时如何解决不明确的列引用。假设有两个表，a 和 b，它们都有一个 name 列。如果我加入这两个表并为结果添加别名，我不知道如何为这两个表引用 name 列。我已经尝试了一些变体，
mysql - 具有 IN 条件的自定义订单
我的查询是: select * from table where id IN (1,5,4,3,2) 我想要的与这个顺序完全相同，不是从1...5，而是从1,5,4,3,2。我怎样才能做到这一点？最
c# - 具有@符号的列名
我正在使用 C# 代码执行动态生成的 MySQL 查询。抛出异常: CREATE TABLE dump ("@employee_OID" VARCHAR(50)); "{"You have an er
java - 具有 + 号的日期格式问题
我有日期 2016-03-30T23:59:59.000000+0000。我可以知道它的格式是什么吗？因为如果我使用 yyyy-MM-dd'T'HH:mm:ss.SSS，它会抛出异常最佳答案 Sim
MYSQL - 具有 in 子句的删除查询中的语法错误
我有一个示例模式，它的 SQL Fiddle 如下: http://sqlfiddle.com/#!2/6816b/2 这个 fiddle 只是根据 where 子句中的条件查询示例数据库，如下所示:

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

algorithm - 压缩具有特定顺序的正整数向量 (int32)