c - `__uint128_t` 上最有效的 popcount ？-6ren

c - `__uint128_t` 上最有效的 popcount ？

转载作者：太空狗更新时间：2023-10-29 17:21:55

34

4

我需要以最有效(最快)的方式弹出一个 128 位大小的无符号变量。

操作系统:Linux/Debian 9

编译器:GCC 8

CPU:英特尔 i7-5775C

虽然如果解决方案更便携，那就更好了。

首先，GCC中有两种类型，分别是 __uint128_t和 unsigned __int128 .我猜他们最终是一样的，并且认为没有理由写丑陋的 unsigned __int128东西，所以虽然它应该是新类型，但我更喜欢第一种，它更类似于标准 uint64_t .此外，英特尔还有 __uint128_t这是使用它的另一个原因(便携性)。

我编写了以下代码:

#include <nmmintrin.h>
#include <stdint.h>

static inline   uint_fast8_t    popcnt_u128 (__uint128_t n)
{
    const uint64_t      n_hi    = n >> 64;
    const uint64_t      n_lo    = n;
    const uint_fast8_t  cnt_hi  = _mm_popcnt_u64(n_hi);
    const uint_fast8_t  cnt_lo  = _mm_popcnt_u64(n_lo);
    const uint_fast8_t  cnt     = cnt_hi + cnt_lo;

    return  cnt;
}

这是绝对最快的选择吗？

编辑:

我想到了另一个选择，它可能(或不是)更快:

#include <nmmintrin.h>
#include <stdint.h>

union   Uint128 {
    __uint128_t uu128;
    uint64_t    uu64[2];
};

static inline   uint_fast8_t    popcnt_u128 (__uint128_t n)
{
    const union Uint128 n_u     = {.uu128   = n};
    const uint_fast8_t  cnt_a   = _mm_popcnt_u64(n_u.uu64[0]);
    const uint_fast8_t  cnt_b   = _mm_popcnt_u64(n_u.uu64[1]);
    const uint_fast8_t  cnt     = cnt_a + cnt_b;

    return  cnt;
}

这样，虽然我不知道它是否合法(是吗？ (编辑:它是:Type punning between integer and array using `union`?))，我会避免这种转变。

最佳答案

使用 GCC 和 clang，您的两个函数都编译为相同的 asm 如果删除 static inline ，并且大概会等效地内联。

我建议使用 unsigned , 因为 sizeof(uint_fast8_t) = 1 在 x86-64 Linux 上。 _fast类型回避了“为了什么目的而快速”的问题； fast8 适用于阵列中的紧凑存储，fast32是一种 64 位类型，它可能避免重做指针数学的符号或零扩展，但会浪费数组空间。

clang 知道两个 popcnt 结果的总和适合一个 8 位整数而不会溢出，因此即使将结果求和为 unsigned，它也可以优化零扩展。计数器，但 gcc 没有。 (例如，将返回类型更改为 unsigned，您将获得额外的 movzx eax, dil 指令。)硬件 popcnt指令产生的结果正确地零扩展到 64 位，但分配给 uint_fast8_t又名 uint8_t明确要求编译器将结果截断为 8 位。

x86-64 System V ABI 允许 args 和返回值中的高垃圾，因此当返回类型很窄时，函数的独立版本可以允许进位到 EAX 的高位。

I would avoid the shift.

移位仅存在于 C 源代码中 .在 asm 中，高/低半部分将存储在单独的 64 位寄存器或单独的内存源操作数中。

来自 the Godbolt compiler explorer

# gcc8.3 -O3 -march=haswell  for the union and the shift version
popcnt_u128:
    xor     eax, eax    # break popcnt's false dependency on Intel CPUs
    popcnt  rsi, rsi    # _mm_popcnt_u64(n_hi);
    popcnt  rax, rdi    # popcnt(lo)
    add     eax, esi        # clang uses add al,cl  and doesn't avoid false deps except in a loop
    ret                 # return value in AL (low 8 bits of EAX)

GCC 可以通过在适当的位置执行两个 popcnt 并使用 lea eax, [rdi + rsi] 来避免异或归零。 .但是你说了一些关于数组的事情，所以如果数据来自内存，那么 GCC 通常会 mov-load 然后 popcnt 到位以避免错误的依赖。 ( Why does breaking the "output dependency" of LZCNT matter? ) 或者实际上，它将对目标进行异或零处理，然后使用内存源 popcnt，这可能会稍微小一些代码大小。

I don't trust __builtin_popcountll because it uses long long instead of uint64_t. I think it is insane to create a function that deals with bits and uses a type that isn't of fixed width. I don't know what GCC people were thinking about.

它实际上使用 unsigned long long , 未签名 long long ;那太疯狂了。
unsigned long long至少是 64 位，并且 uint64_t要求正好是 64 位。 (实际上，仅存在于类型为 64 位且无填充的 C 实现中；对它的支持是可选的)。我不确定 GNU C 是否支持任何目标，其中 unsigned long long不是 64 位，或者哪里 uint64_t不可用。甚至 int64_t ，这也需要是 2 的补码。 (如果 GCC 支持任何非 2 的补码目标，则为 IDK。)

您可以将输入转换为 uint64_t以确保没有设置更高的位。来自 uint64_t 的隐式转换至 unsigned long long即使在 ULL 的平台上，也不会设置任何额外的位比 64 位宽。

例如 __builtin_popcountll( (uint64_t)n );将始终安全地计算 n 的低 64 位，不考虑unsigned long long的宽度.

I'm using a very big static array. Do I have to care about cache, or does GCC handle that for me? I thought that was only a problem with malloc and that stuff. GCC knows the array at compile time, so it can do that better than me.

GCC 将(几乎？)永远不会重新安排您的循环以更改内存访问模式。静态数组与 malloc 没有本质区别编辑内存;他们不会免费在缓存中保持热度。见 What Every Programmer Should Know About Memory?了解更多。

但是，如果您只是在内存中顺序循环并对整个数组进行 popcount，那么是否使用 __uint128_t 进行操作实际上并不重要。或不。

clang 将自动矢量化 __builtin_popcntll或 _mm_popcnt_u64在带有 AVX2 的阵列上 vpshufb (作为半字节 LUT)，这在包括 Broadwell 在内的 Intel CPU 上都很好。见 Counting 1 bits (population count) on large data using AVX-512 or AVX-2

但不幸的是，将您的包装函数用于 __uint128_t 的数组打败了那个。请参阅 Godbolt 链接中的最后 2 个函数。

关于c - `__uint128_t` 上最有效的 popcount ？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55008994/

34

4

0

文章推荐： angular - .detectChanges() 在 Angular 测试中不起作用

文章推荐： c - c中将一串32分成4 8个十六进制

文章推荐： Angular `ng update @angular/...` 返回 `401 Unauthorized`

文章推荐： c - 使用不兼容类型调用的没有原型(prototype)的函数

Python:设置位数 (popcount)
我的数据库(oracle 11g)中很少有 blob 被复制，使用 UTL_RAW.BIT_XOR 对 blob 执行 XOR 操作。然后我想计算二进制字符串中设置位的数量，所以写了上面的代码。在一
c++ - 生成具有偶数汉明权重的整数(popcount)c++
我想有效地(通过使用 bit hacks)生成给定数字 k 的所有整数，以便它们具有均匀的汉明权重，而无需显式计算它们的汉明权重。这是按升序还是降序完成对我来说并不重要。如果我可以生成所有具有偶数汉
c++ - 二进制数组的快速 Popcount 指令或汉明距离？
我在 Visual Studio 2010 C++ 上实现我有两个二进制数组。例如， array1[100] = {1,0,1,0,0,1,1, .... } array2[100] = {0,0,
algorithm - 将具有给定 popcount 的位串映射到整数
我有一个返回给定长度 N 和给定 popcount k 的位串的过程，即每个位串恰好有 k 个和N - k 零。我需要计算每个位串返回的频率。目前，我使用关联的 base2 整数来递增 2^N 维数组
c - `__uint128_t` 上最有效的 popcount ？
我需要以最有效(最快)的方式弹出一个 128 位大小的无符号变量。操作系统:Linux/Debian 9 编译器:GCC 8 CPU:英特尔 i7-5775C 虽然如果解决方案更便携，那就更好了。
go - 如何使用汇编优化这个 8 位位置 popcount？
此帖与 Golang assembly implement of _mm_add_epi32 相关，它在两个 [8]int32 中添加成对的元素列表，并返回更新后的第一个。根据 pprof 资料，我
ios - 模拟器和 iOS 设备是否都有 popcount()？
我能找到的最接近的是 Metal Standard Library 中的 T popcount(T x) . 有没有更简单的版本，比如: #include __builtin_popcount();
javascript - 如何在 JavaScript 中对任意长的位序列进行 popcount 或计数
这看起来对于 JavaScript 中最大整数大小内的整数效果很好: function bitCount (n) { var bits = 0 while (n !== 0) { bi
c# - 从 C# 检查 CPU Popcount
有谁知道如何从C#中查看CPU是否支持popcount(人口计数)？我正在尝试将一些国际象棋代码从 C++ 移植到 C#。最佳答案我还没有找到一种简单的方法来检测和使用 C# 中的特殊 CPU
c++ - 在 C++ 中测量 popcount 函数的时间
我感兴趣的是如何将它放入循环中，以便获得 cpu 执行每个不同操作所花费的实时时间 #include #include #include using namespace std; typedef un
java - 计算汉明权重，在 Java 中也称为 popcount？
我不确定如何将其从 C++ 转换为 Java。它是一个计算汉明权重的函数。 /** This is popcount_3() from: * http://en.wikipedia.org/wiki
c++ - C++ NT2 库文档在哪里？它是否具有默认的 popcount 函数？
已关闭。此问题旨在寻求有关书籍、工具、软件库等的建议。不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以
C++ SIMD : accumulate POPCOUNTs of uint64_t elements in an array
我正在尝试使用 SSE 指令为数组中的 uint64_t 整数累积 POPCOUNT。这是我的代码: #include #include #include int main() {
gcc - SSE 4 popcount 为 16 个 8 位值？
我有以下代码，它使用标志 -msse4 使用 GCC 进行编译但问题是弹出计数仅获取转换后的 __m128i 的最后四个 8 位类型。基本上我想要的是计算 __m128i 内的所有 16 个数字类型，
c++ - 在 MSVC 16.6.0 中找不到 std::popcount 标识符
是我做错了什么还是 Microsoft 对 std::popcount 的支持在 Visual Studio 16.6.0 版中被破坏了？我正在使用 Microsoft Visual Studio
c++ - 使用 C++20 的 std::popcount 和 vector 优化是否等同于 popcnt 内在？
C++20 引入了许多新函数，例如 std::popcount ，我使用相同的功能使用 Intel Intrinsic . 我编译了两个选项 - 可以在 Compiler Explorer code
c++ - 使用 C++20 的 std::popcount 和 vector 优化是否等同于 popcnt 内在？
C++20 引入了许多新函数，例如 std::popcount ，我使用相同的功能使用 Intel Intrinsic . 我编译了两个选项 - 可以在 Compiler Explorer code
go - 安装 Golang Oracle 但得到 "error: undefined reference to ' golang.org_x_tools_container_intsets.popcount'”
我尝试使用以下命令在我的 Ubuntu 机器上安装 Oracle: go get golang.org/x/tools/cmd/oracle 但我收到以下错误: #golang.org/x/tools
c++ - 是否可以使用 Wojciech Mula 算法对 __m256i 进行 popcount 并将结果存储在 8 个 32 位字而不是 4 个 64 位字中？
我最近发现 AVX2 没有 __m256i 的 popcount，我发现做类似事情的唯一方法是遵循 Wojciech Mula 算法: __m256i count(__m256i v) { _

首页

博学

6Ren·AI

商城

c - `__uint128_t` 上最有效的 popcount ？