gpt4 book ai didi

c++ - 在未对齐的字节边界上有效地打包 10 位数据

转载 作者:行者123 更新时间:2023-11-30 03:45:29 25 4
gpt4 key购买 nike

我正在尝试对不与字节边界对齐的倍数进行一些位打包。这就是我正在尝试做的事情。

我有一个 512 位数组(8 个 64 位整数)的数据。该数组内部是 10 位数据,与 2 个字节对齐。我需要做的是将 10 位数据(5 个 64 位整数)的 512 位剥离到 320 位。

我可以想到手动方法来执行此操作,我遍历 512 位数组的每个 2 字节部分,屏蔽掉 10 位,或者将其一起考虑字节边界并创建输出 64-位整数。像这样:

void pack512to320bits(uint64 (&array512bits)[8], uint64 (&array320bits)[5])
{
array320bits[0] = (array512bits[0] & maskFor10bits) | ((array512bits[0] & (maskFor10bits << 16)) << 10) |
((array512bits[0] & (maskFor10bits << 32)) << 20) | ((array512bits[0] << 48) << 30) |
((arrayFor512bits[1] & (maskFor10bits)) << 40) | ((arrayFor512bits[1] & (maskFor10bits << 16)) << 50) |
((arrayFor512bits[1] & (0xF << 32)) << 60);
array320bits[1] = 0;
array320bits[2] = 0;
array320bits[3] = 0;
array320bits[4] = 0;
}

我知道这会起作用,但它似乎容易出错,而且不容易扩展到更大的字节序列。

或者,我可以遍历输入数组,将所有 10 位值剥离到一个 vector 中,然后在末尾连接它们,再次确保我与字节边界对齐。像这样:

void pack512to320bits(uint64 (&array512bits)[8], uint64 (&array320bits)[5])
{
static uint64 maskFor10bits = 0x3FF;
std::vector<uint16> maskedPixelBytes(8 * 4);

for (unsigned int qword = 0; qword < 8; ++qword)
{
for (unsigned int pixelBytes = 0; pixelBytes < 4; ++pixelBytes)
{
maskedPixelBytes[qword * 4 + pixelBytes] = (array512bits[qword] & (maskFor10bits << (16 * pixelbytes)));
}
}
array320bits[0] = maskedPixelBytes[0] | (maskedPixelBytes[1] << 10) | (maskedPixelBytes[2] << 20) | (maskedPixelBytes[3] << 30) |
(maskedPixelBytes[4] << 40) | (maskedPixelBytes[5] << 50) | (maskedPixelBytes[6] << 60);
array320bits[1] = (maskedPixelBytes[6] >> 4) | (maskedPixelBytes[7] << 6) ...


array320bits[2] = 0;
array320bits[3] = 0;
array320bits[4] = 0;
}

这种方式更容易调试/阅读,但效率低下,而且无法扩展到更大的字节序列。我想知道是否有更简单/算法的方法来进行这种位打包。

最佳答案

可以做你想做的事,但这取决于一定的条件和你认为有效率的事情。

首先,如果 2 个数组总是 1 512 位和 1 320 位数组,也就是说,如果传递的数组总是 uint64 (&array512bits)[8]uint64 (&array320bits)[5] ,那么对填充进行硬编码实际上效率要高几个数量级。

如果您想考虑更大的字节序列,您可以创建一个算法,将填充考虑在内并相应地移动位,然后遍历 uint64较大位数组的值。然而,使用这种方法会在程序集中引入增加计算时间的分支(例如 if (total_shifted < bit_size) 等)。即使进行了优化,生成的程序集仍然会比手动进行移位更复杂,而且,执行此操作的代码需要考虑每个数组的大小,以确保它们能够适本地相互适应,从而增加更多的计算时间(或一般代码复杂度)。

例如,考虑这个手动类次代码:

static void pack512to320_manual(uint64 (&a512)[8], uint64 (&a320)[5])
{
a320[0] = (
(a512[0] & 0x00000000000003FF) | // 10 -> 10
((a512[0] & 0x0000000003FF0000) >> 6) | // 10 -> 20
((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
((a512[1] & 0x0000000F00000000) << 28)); // 4 -> 64

a320[1] = (
((a512[1] & 0x000003F000000000) >> 36) | // 6 -> 6
((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
((a512[2] & 0x000003FF00000000) << 4) | // 10 -> 46
((a512[2] & 0x03FF000000000000) >> 2) | // 10 -> 56
((a512[3] & 0x00000000000000FF) << 56)); // 8 -> 64

a320[2] = (
((a512[3] & 0x0000000000000300) >> 8) | // 2 -> 2
((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
((a512[4] & 0x0003000000000000) << 14)); // 2 -> 64

a320[3] = (
((a512[4] & 0x03FC000000000000) >> 50) | // 8 -> 8
((a512[5] & 0x00000000000003FF) << 8) | // 10 -> 18
((a512[5] & 0x0000000003FF0000) << 2) | // 10 -> 28
((a512[5] & 0x000003FF00000000) >> 4) | // 10 -> 38
((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
((a512[6] & 0x00000000003F0000) << 42)); // 6 -> 64

a320[4] = (
((a512[6] & 0x0000000003C00000) >> 22) | // 4 -> 4
((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
((a512[7] & 0x03FF000000000000) << 6)); // 10 -> 64
}

此代码只接受 uint64 的数组将考虑到 10 位边界的类型相互适应并相应地移动,以便将 512 位数组打包到 320 位数组中,因此执行类似 uint64* a512p = a512; pack512to320_manual(a512p, a320); 的操作。自 a512p 以来将在编译时失败不是 uint64 (&)[8] (即类型安全)。请注意,此代码已完全扩展以显示位移序列,但您可以使用 #define的或 enum以避免“魔数(Magic Number)”并使代码更清晰。

如果您想扩展它以考虑更大的字节序列,您可以执行以下操作:

template < std::size_t X, std::size_t Y >
static void pack512to320_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
const uint64* start = array512bits;
const uint64* end = array512bits + (X-1);
uint64 tmp = *start;
uint64 tmask = 0;
int i = 0, tot = 0, stot = 0, rem = 0, z = 0;
bool excess = false;
while (start <= end) {
while (stot < bit_size) {
array320bits[i] |= ((tmp & 0x00000000000003FF) << tot);
tot += 10; // increase shift left by 10 bits
tmp = tmp >> 16; // shift off 2 bytes
stot += 16; // increase shifted total
if ((excess = ((tot + 10) >= bit_size))) { break; }
}
if (stot == bit_size) {
tmp = *(++start); // get next value
stot = 0;
}
if (excess) {
rem = (bit_size - tot); // remainder bits to shift off
tot = 0;
// create the mask
tmask = 0;
for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
// get the last bits
array320bits[i++] |= ((tmp & tmask) << (bit_size - rem));
// shift off and adjust
tmp = tmp >> rem;
rem = (10 - rem);
// new mask
tmask = 0;
for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
array320bits[i] = (tmp & tmask);

tot += rem; // increase shift left by remainder bits
tmp = tmp >> (rem + 6); // shift off 2 bytes
stot += 16;
excess = false;
}
}
}

此代码还考虑了字节边界并将它们打包到 512 位数组中。但是,这段代码不会做任何错误检查以确保尺寸正确匹配,所以如果 X % 8 != 0Y % 5 != 0 (其中 XY > 0),您可能会得到无效的结果!此外,由于涉及循环、临时和移位,它比手动版本慢得多,而且,与函数代码相比,第一次阅读函数代码的读者可能需要更多时间来破译循环代码的完整意图和上下文。位移位版本。

如果您想要介于两者之间的东西,您可以使用手动打包功能并以 8 和 5 为一组迭代较大的字节数组,以确保字节正确对齐;类似于以下内容:

template < std::size_t X, std::size_t Y >
static void pack512to320_manual_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
if (((X == 0) || (X % 8 != 0)) || ((Y == 0) || (Y % 5 != 0)) || ((X < Y) || (Y % X != Y))) {
// handle invalid sizes how you need here
std::cerr << "Invalid sizes!" << std::endl;
return;
}
uint64* a320 = array320bits;
const uint64* end = array512bits + (X-1);
for (const uint64* a512 = array512bits; a512 < end; a512 += 8) {
*a320 = (
(a512[0] & 0x00000000000003FF) | // 10 -> 10
((a512[0] & 0x0000000003FF0000) >> 6) | // 10 -> 20
((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
((a512[1] & 0x0000000F00000000) << 28)); // 4 -> 64
++a320;

*a320 = (
((a512[1] & 0x000003F000000000) >> 36) | // 6 -> 6
((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
((a512[2] & 0x000003FF00000000) << 4) | // 10 -> 46
((a512[2] & 0x03FF000000000000) >> 2) | // 10 -> 56
((a512[3] & 0x00000000000000FF) << 56)); // 8 -> 64
++a320;

*a320 = (
((a512[3] & 0x0000000000000300) >> 8) | // 2 -> 2
((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
((a512[4] & 0x0003000000000000) << 14)); // 2 -> 64
++a320;

*a320 = (
((a512[4] & 0x03FC000000000000) >> 50) | // 8 -> 8
((a512[5] & 0x00000000000003FF) << 8) | // 10 -> 18
((a512[5] & 0x0000000003FF0000) << 2) | // 10 -> 28
((a512[5] & 0x000003FF00000000) >> 4) | // 10 -> 38
((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
((a512[6] & 0x00000000003F0000) << 42)); // 6 -> 64
++a320;

*a320 = (
((a512[6] & 0x0000000003C00000) >> 22) | // 4 -> 4
((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
((a512[7] & 0x03FF000000000000) << 6)); // 10 -> 64
++a320;
}
}

这类似于手动打包功能,只为检查增加了微不足道的时间,但可以处理更大的数组,这些数组将干净地相互打包(再次展开以显示序列)。

g++ 4.2.1 为上面的例子计时使用 -O3在 i7@2.2GHz 上产生了这些平均时间:

pack512to320_loop: 0.135 us

pack512to320_manual: 0.0017 us

pack512to320_manual_loop: 0.0020 us

下面是用于测试输入/输出和一般时序的测试代码:

#include <iostream>
#include <ctime>
#if defined(_MSC_VER)
#include <cstdint>
#include <windows.h>
#define timesruct LARGE_INTEGER
#define dotick(v) QueryPerformanceCounter(&v)
timesruct freq;
#else
#define timesruct struct timespec
#define dotick(v) clock_gettime(CLOCK_MONOTONIC, &v)
#endif

static const std::size_t bit_size = sizeof(uint64) * 8;

template < std::size_t X, std::size_t Y >
static void pack512to320_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
const uint64* start = array512bits;
const uint64* end = array512bits + (X-1);
uint64 tmp = *start;
uint64 tmask = 0;
int i = 0, tot = 0, stot = 0, rem = 0, z = 0;
bool excess = false;
// this line is only here for validities sake,
// it was commented out during testing for performance
for (z = 0; z < Y; ++z) { array320bits[z] = 0; }
while (start <= end) {
while (stot < bit_size) {
array320bits[i] |= ((tmp & 0x00000000000003FF) << tot);
tot += 10; // increase shift left by 10 bits
tmp = tmp >> 16; // shift off 2 bytes
stot += 16; // increase shifted total
if ((excess = ((tot + 10) >= bit_size))) { break; }
}
if (stot == bit_size) {
tmp = *(++start); // get next value
stot = 0;
}
if (excess) {
rem = (bit_size - tot); // remainder bits to shift off
tot = 0;
// create the mask
tmask = 0;
for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
// get the last bits
array320bits[i++] |= ((tmp & tmask) << (bit_size - rem));
// shift off and adjust
tmp = tmp >> rem;
rem = (10 - rem);
// new mask
tmask = 0;
for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
array320bits[i] = (tmp & tmask);

tot += rem; // increase shift left by remainder bits
tmp = tmp >> (rem + 6); // shift off 2 bytes
stot += 16;
excess = false;
}
}
}

template < std::size_t X, std::size_t Y >
static void pack512to320_manual_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
if (((X == 0) || (X % 8 != 0)) || ((Y == 0) || (Y % 5 != 0)) || ((X < Y) || (Y % X != Y))) {
// handle invalid sizes how you need here
std::cerr << "Invalid sizes!" << std::endl;
return;
}
uint64* a320 = array320bits;
const uint64* end = array512bits + (X-1);
for (const uint64* a512 = array512bits; a512 < end; a512 += 8) {
*a320 = (
(a512[0] & 0x00000000000003FF) | // 10 -> 10
((a512[0] & 0x0000000003FF0000) >> 6) | // 10 -> 20
((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
((a512[1] & 0x0000000F00000000) << 28)); // 4 -> 64
++a320;

*a320 = (
((a512[1] & 0x000003F000000000) >> 36) | // 6 -> 6
((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
((a512[2] & 0x000003FF00000000) << 4) | // 10 -> 46
((a512[2] & 0x03FF000000000000) >> 2) | // 10 -> 56
((a512[3] & 0x00000000000000FF) << 56)); // 8 -> 64
++a320;

*a320 = (
((a512[3] & 0x0000000000000300) >> 8) | // 2 -> 2
((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
((a512[4] & 0x0003000000000000) << 14)); // 2 -> 64
++a320;

*a320 = (
((a512[4] & 0x03FC000000000000) >> 50) | // 8 -> 8
((a512[5] & 0x00000000000003FF) << 8) | // 10 -> 18
((a512[5] & 0x0000000003FF0000) << 2) | // 10 -> 28
((a512[5] & 0x000003FF00000000) >> 4) | // 10 -> 38
((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
((a512[6] & 0x00000000003F0000) << 42)); // 6 -> 64
++a320;

*a320 = (
((a512[6] & 0x0000000003C00000) >> 22) | // 4 -> 4
((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
((a512[7] & 0x03FF000000000000) << 6)); // 10 -> 64
++a320;
}
}

static void pack512to320_manual(uint64 (&a512)[8], uint64 (&a320)[5])
{
a320[0] = (
(a512[0] & 0x00000000000003FF) | // 10 -> 10
((a512[0] & 0x0000000003FF0000) >> 6) | // 10 -> 20
((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
((a512[1] & 0x0000000F00000000) << 28)); // 4 -> 64

a320[1] = (
((a512[1] & 0x000003F000000000) >> 36) | // 6 -> 6
((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
((a512[2] & 0x000003FF00000000) << 4) | // 10 -> 46
((a512[2] & 0x03FF000000000000) >> 2) | // 10 -> 56
((a512[3] & 0x00000000000000FF) << 56)); // 8 -> 64

a320[2] = (
((a512[3] & 0x0000000000000300) >> 8) | // 2 -> 2
((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
((a512[4] & 0x0003000000000000) << 14)); // 2 -> 64

a320[3] = (
((a512[4] & 0x03FC000000000000) >> 50) | // 8 -> 8
((a512[5] & 0x00000000000003FF) << 8) | // 10 -> 18
((a512[5] & 0x0000000003FF0000) << 2) | // 10 -> 28
((a512[5] & 0x000003FF00000000) >> 4) | // 10 -> 38
((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
((a512[6] & 0x00000000003F0000) << 42)); // 6 -> 64

a320[4] = (
((a512[6] & 0x0000000003C00000) >> 22) | // 4 -> 4
((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
((a512[7] & 0x03FF000000000000) << 6)); // 10 -> 64
}

template < std::size_t N >
static void printit(uint64 (&arr)[N])
{
for (std::size_t i = 0; i < N; ++i) {
std::cout << "arr[" << i << "] = " << arr[i] << std::endl;
}
}

static double elapsed_us(timesruct init, timesruct end)
{
#if defined(_MSC_VER)
if (freq.LowPart == 0) { QueryPerformanceFrequency(&freq); }
return (static_cast<double>(((end.QuadPart - init.QuadPart) * 1000000)) / static_cast<double>(freq.QuadPart));
#else
return ((end.tv_sec - init.tv_sec) * 1000000) + (static_cast<double>((end.tv_nsec - init.tv_nsec)) / 1000);
#endif
}

int main(int argc, char* argv[])
{
uint64 val = 0x039F039F039F039F;
uint64 a512[] = { val, val, val, val, val, val, val, val };
uint64 a320[] = { 0, 0, 0, 0, 0 };
int max_cnt = 1000000;
timesruct init, end;
std::cout << std::hex;

dotick(init);
for (int i = 0; i < max_cnt; ++i) {
pack512to320_loop(a512, a320);
}
dotick(end);
printit(a320);
// rough estimate of timing / divide by iterations
std::cout << "avg. us = " << (elapsed_us(init, end) / max_cnt) << " us" << std::endl;

dotick(init);
for (int i = 0; i < max_cnt; ++i) {
pack512to320_manual(a512, a320);
}
dotick(end);
printit(a320);
// rough estimate of timing / divide by iterations
std::cout << "avg. us = " << (elapsed_us(init, end) / max_cnt) << " us" << std::endl;

dotick(init);
for (int i = 0; i < max_cnt; ++i) {
pack512to320_manual_loop(a512, a320);
}
dotick(end);
printit(a320);
// rough estimate of timing / divide by iterations
std::cout << "avg. us = " << (elapsed_us(init, end) / max_cnt) << " us" << std::endl;

return 0;
}

同样,这只是通用测试代码,您的结果可能会有所不同。

希望对您有所帮助。

关于c++ - 在未对齐的字节边界上有效地打包 10 位数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34775546/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com