gpt4 book ai didi

multithreading - 编译器优化破坏多线程代码

转载 作者:行者123 更新时间:2023-12-04 04:29:47 24 4
gpt4 key购买 nike

苦练后 shared variables are currently not guarded by memory barriers ,我现在遇到了另一个问题。要么我做错了什么,要么 dmd 中现有的编译器优化可以通过重新排序读取 shared 来破坏多线程代码。变量。

例如,当我使用 dmd -O 编译可执行文件时(完全优化),编译器愉快地优化掉了局部变量 o在此代码中(其中 cas 是来自 core.atomic 的比较和交换函数)

shared uint cnt;
void atomicInc ( ) { uint o; do { o = cnt; } while ( !cas( &cnt, o, o + 1 ) );}

像这样(见下面的拆卸):
shared uint cnt;
void atomicInc ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }

在“优化”代码中 cnt从内存中读取两次,从而冒着另一个线程修改了 cnt 的风险。之间。优化基本上破坏了比较和交换算法。

这是一个错误,还是有正确的方法来达到预期的结果?到目前为止,我发现的唯一解决方法是使用汇编程序实现代码。

完整的测试代码和其他详细信息
为了完整起见,这里是一个完整的测试代码,显示了两个问题(无内存障碍和优化问题)。它在三台不同的 Windows 机器上为 dmd 2.049 和 dmd 2.050 生成以下输出(假设 Dekker 的算法没有死锁,这可能会发生):
dmd -O -run optbug.d
CAS : failed
Dekker: failed

和里面的循环 atomicInc完全优化编译成这个:
; cnt is stored at 447C10h
; while ( !cas( &cnt, o, o + 1 ) ) o = cnt;
; 1) prepare call cas( &cnt, o, o + 1 ): &cnt and o go to stack, o+1 to eax
402027: mov ecx,447C10h ; ecx = &cnt
40202C: mov eax,[447C10h] ; eax = o1 = cnt
402031: inc eax ; eax = o1 + 1 (third parameter)
402032: push ecx ; push &cnt (first parameter)
; next instruction pushes current value of cnt onto stack
; as second parameter o instead of re-using o1
402033: push [447C10h]
402039: call 4020BC ; 2) call cas
40203E: xor al,1 ; 3) test success
402040: jne 402027 ; no success try again
; end of main loop

下面是测试代码:
import core.atomic;
import core.thread;
import std.stdio;

enum loops = 0xFFFF;
shared uint cnt;

/* *****************************************************************************
Implement atomicOp!("+=")(cnt, 1U); with CAS. The code below doesn't work with
the "-O" compiler flag because cnt is read twice while calling cas and another
thread can modify cnt in between.
*/
enum threads = 8;

void atomicInc ( ) { uint o; do { o = cnt; } while ( !cas( &cnt, o, o + 1 ) );}
void threadFunc ( ) { foreach (i; 0..loops) atomicInc; }

void testCas ( ) {
cnt = 0;
auto tgCas = new ThreadGroup;
foreach (i; 0..threads) tgCas.create(&threadFunc);
tgCas.joinAll;
writeln( "CAS : ", cnt == loops * threads ? "passed" : "failed" );
}

/* *****************************************************************************
Dekker's algorithm. Fails on ia32 (other than atom) because ia32 can re-order
read before write. Most likely fails on many other architectures.
*/
shared bool flag1 = false;
shared bool flag2 = false;
shared bool turn2 = false; // avoids starvation by executing 1 and 2 in turns

void dekkerInc ( ) {
flag1 = true;
while ( flag2 ) if ( turn2 ) {
flag1 = false; while ( turn2 ) { /* wait until my turn */ }
flag1 = true;
}
cnt++; // shouldn't work without a cast
turn2 = true; flag1 = false;
}

void dekkerDec ( ) {
flag2 = true;
while ( flag1 ) if ( !turn2 ) {
flag2 = false; while ( !turn2 ) { /* wait until my turn */ }
flag2 = true;
}
cnt--; // shouldn't work without a cast
turn2 = false; flag2 = false;
}

void threadDekkerInc ( ) { foreach (i; 0..loops) dekkerInc; }
void threadDekkerDec ( ) { foreach (i; 0..loops) dekkerDec; }

void testDekker ( ) {
cnt = 0;
auto tgDekker = new ThreadGroup;
tgDekker.create( &threadDekkerInc );
tgDekker.create( &threadDekkerDec );
tgDekker.joinAll;
writeln( "Dekker: ", cnt == 0 ? "passed" : "failed" );
}

/* ************************************************************************** */
void main() {
testCas;
testDekker;
}

最佳答案

虽然问题似乎仍然存在, core.atomic 现在公开 atomicLoad这可以实现相对简单的解决方法。使cas示例工作,加载 cnt 就足够了原子地:

void atomicInc  ( ) { 
uint o;
do {
o = atomicLoad(cnt);
} while ( !cas( &cnt, o, o + 1 ) );
}

同样,要使 Dekker 算法起作用:
// ...
while ( atomicLoad(flag2) ) if ( turn2 ) {
// ...
while ( atomicLoad(flag1) ) if ( !turn2 ) {
// ...

对于 ia32(忽略字符串操作和 SSE)以外的架构,也可以重新排序
  • 读取相对读取
  • 或相对于写入的写入
  • 或写入和读取到相同的内存位置

  • 需要额外的内存屏障。

    关于multithreading - 编译器优化破坏多线程代码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4165149/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com