C:使用查找表评估一组有限的小整数值的函数的最快方法？-6ren

C:使用查找表评估一组有限的小整数值的函数的最快方法？

转载作者：太空狗更新时间：2023-10-29 16:01:07

我目前正在做一个项目，我想通过调用 C 来优化 Python 中的一些数值计算。

简而言之，我需要为一个巨大的数组 x 中的每个元素计算 y[i] = f(x[i]) 的值(通常有10^9 个条目或更多)。这里，x[i] 是一个介于 -10 和 10 之间的整数，f 是接受 x[i] 并返回 double 值的函数。我的问题是 f，但它需要很长时间才能以数值稳定的方式进行评估。

为了加快速度，我想将 f(x[i]) 的所有 2*10 + 1 可能值硬编码到常量数组中，例如:

double table_of_values[] = {f(-10), ...., f(10)};

然后使用“查找表”方法评估 f，如下所示:

for (i = 0; i < N; i++) {
    y[i] = table_of_values[x[i] + 11]; //instead of y[i] = f(x[i])
}

由于我不太精通用 C 编写优化代码，所以我想知道:

特别是 - 因为 x 真的很大 - 我想知道在评估循环时是否值得进行二级优化(例如通过预先排序 x ，或者找到一种聪明的方法来处理负指数(除了只执行 [x[i] + 10 + 1])？
假设 x[i] 不在 -10 和 10 之间，而是在 -20 和 20 之间。在这种情况下，我仍然可以使用相同的方法，但需要努力手动编码查找表。有没有办法在代码中动态生成查找表，以便我使用相同的方法并允许 x[i] 属于一个变量范围？

最佳答案

生成这样一个具有动态范围值的表格相当容易。

这是一个简单的单表方法:

#include <malloc.h>

#define VARIABLE_USED(_sym) \
    do { \
        if (1) \
            break; \
        if (!! _sym) \
            break; \
    } while (0)

double *table_of_values;
int table_bias;

// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 0
typedef short xval_t;
#endif
#if 1
typedef char xval_t;
#endif

#define XLEN        (1 << 9)
xval_t *x;

// fslow -- your original function
double
fslow(int i)
{
    return 1;  // whatever
}

// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi)
{
    int len;

    table_bias = -lo;

    len = hi - lo;
    len += 1;

    // NOTE: you can do free(table_of_values) when no longer needed
    table_of_values = malloc(sizeof(double) * len);

    for (int i = lo;  i <= hi;  ++i)
        table_of_values[i + table_bias] = f(i);
}

// fcached -- retrieve cached table data
double
fcached(int i)
{
    return table_of_values[i + table_bias];
}

// fripper -- access x and table arrays
void
fripper(xval_t *x)
{
    double *tptr;
    int bias;
    double val;

    // ensure these go into registers to prevent needless extra memory fetches
    tptr = table_of_values;
    bias = table_bias;

    for (int i = 0;  i < XLEN;  ++i) {
        val = tptr[x[i] + bias];

        // do stuff with val
        VARIABLE_USED(val);
    }
}

int
main(void)
{

    ftablegen(fslow,-10,10);
    x = malloc(sizeof(xval_t) * XLEN);
    fripper(x);

    return 0;
}

这里有一个稍微复杂的方法，可以生成许多类似的表:

#include <malloc.h>

#define VARIABLE_USED(_sym) \
    do { \
        if (1) \
            break; \
        if (!! _sym) \
            break; \
    } while (0)

// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 1
typedef short xval_t;
#endif
#if 0
typedef char xval_t;
#endif

#define XLEN        (1 << 9)
xval_t *x;

struct table {
    int tbl_lo;                         // lowest index
    int tbl_hi;                         // highest index
    int tbl_bias;                       // bias for index
    double *tbl_data;                   // cached data
};

struct table ftable1;
struct table ftable2;

double
fslow(int i)
{
    return 1;  // whatever
}

double
f2(int i)
{
    return 2;  // whatever
}

// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi,struct table *tbl)
{
    int len;

    tbl->tbl_bias = -lo;

    len = hi - lo;
    len += 1;

    // NOTE: you can do free tbl_data when no longer needed
    tbl->tbl_data = malloc(sizeof(double) * len);

    for (int i = lo;  i <= hi;  ++i)
        tbl->tbl_data[i + tbl->tbl_bias] = fslow(i);
}

// fcached -- retrieve cached table data
double
fcached(struct table *tbl,int i)
{
    return tbl->tbl_data[i + tbl->tbl_bias];
}

// fripper -- access x and table arrays
void
fripper(xval_t *x,struct table *tbl)
{
    double *tptr;
    int bias;
    double val;

    // ensure these go into registers to prevent needless extra memory fetches
    tptr = tbl->tbl_data;
    bias = tbl->tbl_bias;

    for (int i = 0;  i < XLEN;  ++i) {
        val = tptr[x[i] + bias];

        // do stuff with val
        VARIABLE_USED(val);
    }
}

int
main(void)
{

    x = malloc(sizeof(xval_t) * XLEN);

    // NOTE: we could use 'char' for xval_t ...
    ftablegen(fslow,-37,62,&ftable1);
    fripper(x,&ftable1);

    // ... but, this forces us to use a 'short' for xval_t
    ftablegen(f2,-99,307,&ftable2);

    return 0;
}

注意事项:

fcached可能/应该是 inline速度函数。注意一旦表被计算一次，fcached(x[i])相当快。您提到的索引偏移问题 [由“偏差”解决] 在计算时间上非常小。

同时 x可能是一个大数组，f() 的缓存数组值相当小(例如 -10 到 10)。即使它是(例如)-100 到 100，这仍然是大约 200 个元素。这个小的缓存数组将 [可能] 保留在硬件内存缓存中，因此访问将保持相当快。

因此，排序 x优化查找表的 H/W 缓存性能将几乎没有 [可衡量的] 影响。

x 的访问模式是独立的。如果您访问 x，您将获得最佳性能以线性方式(例如 for (i = 0; i < 999999999; ++i) x[i] )。如果您以半随机方式访问它，它将对 H/W 缓存逻辑及其保持所需/所需 x 的能力造成压力。值“缓存热”

即使是线性访问，因为 x太大了，当你到达终点时，第一个元素已经从 H/W 缓存中被逐出(例如，大多数 CPU 缓存都在几兆字节的数量级)

但是，如果 x仅具有有限范围内的值，将类型从 int x[...] 更改为至 short x[...]甚至 char x[...]将大小缩小 2 倍 [或 4 倍]。而且，可以对性能有可衡量的改进。

更新:我添加了一个 fripper函数显示最快的方式[我知道]访问表和x循环中的数组。我还添加了 typedef名为 xval_t允许 x阵列消耗更少的空间(即具有更好的 H/W 缓存性能)。

更新 #2:

根据您的意见...

fcached [主要] 被编码以说明简单/单一访问。但是，它没有在最后的例子中使用。

多年来，对内联的确切要求有所不同(例如，外部内联)。现在最好使用:static inline .但是，如果使用 c++ ，它可能又是不同的。有整页专门讨论这个。原因是因为在不同的地方编译.c文件，打开或关闭优化时会发生什么。另外，考虑使用 gcc延期。因此，始终强制内联:

__attribute__((__always_inline__)) static inline

fripper是最快的，因为它避免了重新获取全局变量 table_of_values和 table_bias在每次循环迭代中。在 fripper 中，编译器优化器将确保它们保留在寄存器中。查看我的回答:Is accessing statically or dynamically allocated memory faster?至于为什么。

但是，我编码了一个fripper使用 fcached 的变体并且反汇编代码是相同的[并且最优]。所以，我们可以无视那个……或者，我们可以吗？有时，反汇编代码是一种很好的交叉检查，也是唯一确定的方法。创建完全优化的 C 代码时只是一个额外的项目。关于代码生成，可以为编译器提供许多选项，因此有时它只是反复试验。

因为基准测试很重要，所以我加入了时间戳例程(仅供引用，[AFAIK] 底层 clock_gettime 调用是 python 的 time.clock() 的基础)。

所以，这是更新后的版本:

#include <malloc.h>
#include <time.h>

typedef long long s64;

#define SUPER_INLINE \
    __attribute__((__always_inline__)) static inline

#define VARIABLE_USED(_sym) \
    do { \
        if (1) \
            break; \
        if (!! _sym) \
            break; \
    } while (0)

#define TVSEC           1000000000LL    // nanoseconds in a second
#define TVSECF          1e9             // nanoseconds in a second

// tvget -- get high resolution time of day
// RETURNS: absolute nanoseconds
s64
tvget(void)
{
    struct timespec ts;
    s64 nsec;

    clock_gettime(CLOCK_REALTIME,&ts);

    nsec = ts.tv_sec;
    nsec *= TVSEC;
    nsec += ts.tv_nsec;

    return nsec;
)

// tvgetf -- get high resolution time of day
// RETURNS: fractional seconds
double
tvgetf(void)
{
    struct timespec ts;
    double sec;

    clock_gettime(CLOCK_REALTIME,&ts);

    sec = ts.tv_nsec;
    sec /= TVSECF;
    sec += ts.tv_sec;

    return sec;
)

double *table_of_values;
int table_bias;

double *dummyptr;

// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 0
typedef short xval_t;
#endif
#if 1
typedef char xval_t;
#endif

#define XLEN        (1 << 9)
xval_t *x;

// fslow -- your original function
double
fslow(int i)
{
    return 1;  // whatever
}

// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi)
{
    int len;

    table_bias = -lo;

    len = hi - lo;
    len += 1;

    // NOTE: you can do free(table_of_values) when no longer needed
    table_of_values = malloc(sizeof(double) * len);

    for (int i = lo;  i <= hi;  ++i)
        table_of_values[i + table_bias] = f(i);
}

// fcached -- retrieve cached table data
SUPER_INLINE double
fcached(int i)
{
    return table_of_values[i + table_bias];
}

// fripper_fcached -- access x and table arrays
void
fripper_fcached(xval_t *x)
{
    double val;
    double *dptr;

    dptr = dummyptr;

    for (int i = 0;  i < XLEN;  ++i) {
        val = fcached(x[i]);

        // do stuff with val
        dptr[i] = val;
    }
}

// fripper -- access x and table arrays
void
fripper(xval_t *x)
{
    double *tptr;
    int bias;
    double val;
    double *dptr;

    // ensure these go into registers to prevent needless extra memory fetches
    tptr = table_of_values;
    bias = table_bias;
    dptr = dummyptr;

    for (int i = 0;  i < XLEN;  ++i) {
        val = tptr[x[i] + bias];

        // do stuff with val
        dptr[i] = val;
    }
}

int
main(void)
{

    ftablegen(fslow,-10,10);
    x = malloc(sizeof(xval_t) * XLEN);
    dummyptr = malloc(sizeof(double) * XLEN);
    fripper(x);
    fripper_fcached(x);

    return 0;
}

关于C:使用查找表评估一组有限的小整数值的函数的最快方法？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35326332/

文章推荐： c - C 中的字节顺序宏

文章推荐： HTML 属性 : quoted or unquoted?

文章推荐： android:layout_alignBottom 不适用于 API 18 及以下版本？

java - 有限 for 循环迭代的复杂性
我有一个关于复杂性的简单问题。我在 Java 中有这段代码: pairs是 HashMap包含 Integer作为键，它的频率为 Collection作为一个值。所以: pairs = new Has
dictionary - 有限 map 示例
对于我的应用程序，我需要在 Coq 中使用和推理有限映射。谷歌搜索我发现 FMapAVL 似乎非常适合我的需求。问题是文档很少，我还没有弄清楚我应该如何使用它。作为一个简单的例子，考虑以下使用对列表
mysql - 从关联表的单个(有限)行返回多列
我有一个主表tblAssetMaster A和一个移动表tblMovement M。我想提取所有 Assets 及其当前位置，因此需要获取每个 Assets 的最新移动条目。字段 A: Asset
html - 有限 margin 汽车？
我想让我的网站内容居中，但仅限于网页的特定宽度。所以当它超过 500px 时，我希望内容被修复，无法进一步拉伸(stretch)。无论如何都要这样做，还是我最好把所有东西都修好？希望有意义的是添加一些
javascript - 主干模型.destroy() 有限
我正在尝试批量删除 Backbone 模型的集合，如下所示...... collection.each(function(model, i){ model.destroy(); }); 我发现当每
performance - 模拟处理器的(有限)资源，包括时钟速度
我想要一个软件环境，在其中我可以在具有特定资源的硬件上测试我的软件的速度。例如，当我的主机硬件是具有 12GB RAM 的 3GHz 四核 amd64 时，该程序在具有 24 Mb RAM 的 800
java - Eclipse 中的 BigInteger 有限
在 Eclipse 中，我得到了 BigInteger.valueOf(2).pow(31093) 的值，但没有得到 BigInteger.valueOf(2).pow(31094) 的值(它是空的)
algorithm - 当循环中出现异常时，在继续下一次迭代之前尝试该操作 3(有限)次的最佳方法是什么？
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。要求提供代码的问题必须表现出对所解决问题的最低限度理解。包括尝试过的解决方案、为什么它们不起作用，以及预
php - HTML 和 PHP 的帖子值(value)有限
我想将 2 个表从本地 sql server 2000 上传到托管的 mysql。第一个表有 17 列和 680 行，其他 10 列和 8071 行。我首先使用 xampp mysql 尝试离线，它
javascript - Html:是否可以使用 Javascript 实现静态 html 的无限(有限？)滚动？
我在 S3 中自动生成并保存了静态 html 文件。有时文件大小达到 2mb。是否可以使用javascript来获取html文件的一部分，显示它，当用户到达页面底部时，获取下一部分等等？最佳答案 X

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

C:使用查找表评估一组有限的小整数值的函数的最快方法？