gpt4 book ai didi

尽可能快地比较缓冲区

转载 作者:可可西里 更新时间:2023-11-01 13:25:43 26 4
gpt4 key购买 nike

我需要按 block 比较两个缓冲区是否相等。我不需要有关两个缓冲区关系的信息,只要每两个 block 是否相等即可。我的intel机器最高支持SSE4.2

天真的做法是:

const size_t CHUNK_SIZE = 16; //128bit for SSE2 integer registers
const int ARRAY_SIZE = 200000000;

char* array_1 = (char*)_aligned_malloc(ARRAY_SIZE, 16);
char* array_2 = (char*)_aligned_malloc(ARRAY_SIZE, 16);

for (size_t i = 0; i < ARRAY_SIZE; )
{
volatile bool result = memcmp(array_1+i, array_2+i, CHUNK_SIZE);
i += CHUNK_SIZE;
}

与我第一次使用 SSE 的尝试相比:

union U
{
__m128i m;
volatile int i[4];
} res;

for (size_t i = 0; i < ARRAY_SIZE; )
{
__m128i* pa1 = (__m128i*)(array_1+i);
__m128i* pa2 = (__m128i*)(array_2+i);
res.m = _mm_cmpeq_epi32(*pa1, *pa2);
volatile bool result = ( (res.i[0]==0) || (res.i[1]==0) || (res.i[2]==0) || (res.i[3]==0) );
i += CHUNK_SIZE;
}

速度提升约 33%。我可以做得更好吗?

最佳答案

你真的不应该使用标量代码和 union 来测试所有单独的 vector 元素 - 而是做这样的事情:

for (size_t i = 0; i < ARRAY_SIZE; i += CHUNK_SIZE)
{
const __m128i a1 = _mm_load_si128(array_1 + i);
const __m128i a2 = _mm_load_si128(array_2 + i);
const __m128i vcmp = _mm_cmpeq_epi32(a1, a2);
const int vmask = _mm_movemask_epi8(vcmp);
const bool result = (vmask == 0xffff);
// you probably want to break here if you get a mismatch ???
}

关于尽可能快地比较缓冲区,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6136670/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com