gpt4 book ai didi

c++ - 使用 SIMD 管理将累积(单个)值打包成两个值的清理代码循环的方法是什么?

转载 作者:行者123 更新时间:2023-11-30 03:19:04 25 4
gpt4 key购买 nike

假设我管理一个 __m128d名为 v_phase 的变量, 计算为

index 0 : load prev phase
index 1 : phase += newValue
index 2 : phase += newValue
index 3 : phase += newValue
index 4 : phase += newValue
...

这是基本代码:

__m128d v_phase;

// load prev cumulated mPhase to v_phase (as mPhase, mPhase + nextValue)

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2, pValue += 2) {
// function with phase

// update pValue increment (its not linear)

// phase increment: v_phase += newValue
}

// cleanup code
if (blockSize % 2 == 0) {
mPhase = v_phase.m128d_f64[0];
}

事实是:如果blockSize是偶数,它工作正常:它将在最后一个循环迭代中求和 另外两个相位值,并取 v_phase.m128d_f64[0] (即新添加的两个中的第一个)。

但是如果blockSize怎么办?奇怪吗?我只需要 v_phase.m128d_f64[1]最后一次迭代的没有对另外两个相位值求和

我可以使用 sampleIndex < blockSize - 1 , 但这将移动逻辑 // function with phase// cleanup code 内(我不太喜欢它)。

在循环中放置一个 if 是我会避免的事情(分支预测;因为我使用的是 SIMD,我正在优化代码,这会变慢)。

有什么建议吗?

这是一个更“完整”的例子:

double phase = mPhase;

__m128d v_pB = _mm_setr_pd(0.0, pB[0]);
v_pB = _mm_mul_pd(v_pB, v_radiansPerSampleBp0);
__m128d v_pC = _mm_setr_pd(0.0, pC[0]);
v_pC = _mm_mul_pd(v_pC, v_radiansPerSample);

__m128d v_pB_prev = _mm_setr_pd(0.0, 0.0);
v_pB_prev = _mm_mul_pd(v_pB_prev, v_radiansPerSampleBp0);
__m128d v_pC_prev = _mm_setr_pd(0.0, 0.0);
v_pC_prev = _mm_mul_pd(v_pC_prev, v_radiansPerSample);

__m128d v_phaseAcc1;
__m128d v_phaseAcc2;
__m128d v_phase = _mm_set1_pd(phase);

// phase
v_phaseAcc1 = _mm_add_pd(v_pB, v_pC);
v_phaseAcc1 = _mm_max_pd(v_phaseAcc1, v_boundLower);
v_phaseAcc1 = _mm_min_pd(v_phaseAcc1, v_boundUpper);
v_phaseAcc2 = _mm_add_pd(v_pB_prev, v_pC_prev);
v_phaseAcc2 = _mm_max_pd(v_phaseAcc2, v_boundLower);
v_phaseAcc2 = _mm_min_pd(v_phaseAcc2, v_boundUpper);
v_phase = _mm_add_pd(v_phase, v_phaseAcc1);
v_phase = _mm_add_pd(v_phase, v_phaseAcc2);

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2, pB += 2, pC += 2) {
// code that will use v_phase

// phase increment
v_pB = _mm_loadu_pd(pB + 1);
v_pB = _mm_mul_pd(v_pB, v_radiansPerSampleBp0);
v_pC = _mm_loadu_pd(pC + 1);
v_pC = _mm_mul_pd(v_pC, v_radiansPerSample);

v_pB_prev = _mm_load_pd(pB);
v_pB_prev = _mm_mul_pd(v_pB_prev, v_radiansPerSampleBp0);
v_pC_prev = _mm_load_pd(pC);
v_pC_prev = _mm_mul_pd(v_pC_prev, v_radiansPerSample);

v_phaseAcc1 = _mm_add_pd(v_pB, v_pC);
v_phaseAcc1 = _mm_max_pd(v_phaseAcc1, v_boundLower);
v_phaseAcc1 = _mm_min_pd(v_phaseAcc1, v_boundUpper);
v_phaseAcc2 = _mm_add_pd(v_pB_prev, v_pC_prev);
v_phaseAcc2 = _mm_max_pd(v_phaseAcc2, v_boundLower);
v_phaseAcc2 = _mm_min_pd(v_phaseAcc2, v_boundUpper);
v_phase = _mm_add_pd(v_phase, v_phaseAcc1);
v_phase = _mm_add_pd(v_phase, v_phaseAcc2);
}

// cleanup code
if (blockSize % 2 == 0) {
mPhase = v_phase.m128d_f64[0];
}
else {
??? if odd?
}

最佳答案

除了最后一个,您还可以输出循环中的previous v_phase。也就是说,在更新您的 v_phase 之前,存储前一个:

__m128d prev_v_phase;
for (...) {
...
prev_v_phase = v_phase;
v_phase = _mm_add_pd(v_phase, v_phaseAcc1);
v_phase = _mm_add_pd(v_phase, v_phaseAcc2);
}

// cleanup code
if (blockSize % 2 == 0) {
mPhase = v_phase.m128d_f64[0];
}
else {
mPhase = prev_v_phase.m128d_f64[1];
}

如果循环根本不执行任何迭代,这将失败(然后 prev_v_phase 将未初始化),但这是性能不重要的情况,因此很容易处理。

关于c++ - 使用 SIMD 管理将累积(单个)值打包成两个值的清理代码循环的方法是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54126042/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com