gpt4 book ai didi

performance - 为什么 sync.Mutex 在 goroutine 争用超过 3400 时会大幅降低性能?

转载 作者:IT王子 更新时间:2023-10-29 01:28:18 25 4
gpt4 key购买 nike

我正在比较有关 sync.Mutex 和 Go channel 的性能。这是我的基准:

// go playground: https://play.golang.org/p/f_u9jHBq_Jc
const (
start = 300 // actual = start * goprocs
end = 600 // actual = end * goprocs
step = 10
)

var goprocs = runtime.GOMAXPROCS(0) // 8

// https://perf.golang.org/search?q=upload:20190819.3
func BenchmarkChanWrite(b *testing.B) {
var v int64
ch := make(chan int, 1)
ch <- 1
for i := start; i < end; i += step {
b.Run(fmt.Sprintf("goroutines-%d", i*goprocs), func(b *testing.B) {
b.SetParallelism(i)
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
<-ch
v += 1
ch <- 1
}
})
})
}
}

// https://perf.golang.org/search?q=upload:20190819.2
func BenchmarkMutexWrite(b *testing.B) {
var v int64
mu := sync.Mutex{}
for i := start; i < end; i += step {
b.Run(fmt.Sprintf("goroutines-%d", i*goprocs), func(b *testing.B) {
b.SetParallelism(i)
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
mu.Lock()
v += 1
mu.Unlock()
}
})
})
}
}

性能对比可视化如下:

enter image description here

这是什么原因

  1. 当 goroutine 的数量超过大约 3400 时,sync.Mutex 会遇到性能大幅下降吗?
  2. Go channels 很稳定但比之前的 sync.Mutex 慢?

benchstat 的原始基准数据 (go test -bench=. -count=5) go version go1.12.4 linux/amd64:

MutexWrite/goroutines-2400-8  48.6ns ± 1%
MutexWrite/goroutines-2480-8 49.1ns ± 0%
MutexWrite/goroutines-2560-8 49.7ns ± 1%
MutexWrite/goroutines-2640-8 50.5ns ± 3%
MutexWrite/goroutines-2720-8 50.9ns ± 2%
MutexWrite/goroutines-2800-8 51.8ns ± 3%
MutexWrite/goroutines-2880-8 52.5ns ± 2%
MutexWrite/goroutines-2960-8 54.1ns ± 4%
MutexWrite/goroutines-3040-8 54.5ns ± 2%
MutexWrite/goroutines-3120-8 56.1ns ± 3%
MutexWrite/goroutines-3200-8 63.2ns ± 5%
MutexWrite/goroutines-3280-8 77.5ns ± 6%
MutexWrite/goroutines-3360-8 141ns ± 6%
MutexWrite/goroutines-3440-8 239ns ± 8%
MutexWrite/goroutines-3520-8 248ns ± 3%
MutexWrite/goroutines-3600-8 254ns ± 2%
MutexWrite/goroutines-3680-8 256ns ± 1%
MutexWrite/goroutines-3760-8 261ns ± 2%
MutexWrite/goroutines-3840-8 266ns ± 3%
MutexWrite/goroutines-3920-8 276ns ± 3%
MutexWrite/goroutines-4000-8 278ns ± 3%
MutexWrite/goroutines-4080-8 286ns ± 5%
MutexWrite/goroutines-4160-8 293ns ± 4%
MutexWrite/goroutines-4240-8 295ns ± 2%
MutexWrite/goroutines-4320-8 280ns ± 8%
MutexWrite/goroutines-4400-8 294ns ± 9%
MutexWrite/goroutines-4480-8 285ns ±10%
MutexWrite/goroutines-4560-8 290ns ± 8%
MutexWrite/goroutines-4640-8 271ns ± 3%
MutexWrite/goroutines-4720-8 271ns ± 4%

ChanWrite/goroutines-2400-8 158ns ± 3%
ChanWrite/goroutines-2480-8 159ns ± 2%
ChanWrite/goroutines-2560-8 161ns ± 2%
ChanWrite/goroutines-2640-8 161ns ± 1%
ChanWrite/goroutines-2720-8 163ns ± 1%
ChanWrite/goroutines-2800-8 166ns ± 3%
ChanWrite/goroutines-2880-8 168ns ± 1%
ChanWrite/goroutines-2960-8 176ns ± 4%
ChanWrite/goroutines-3040-8 176ns ± 2%
ChanWrite/goroutines-3120-8 180ns ± 1%
ChanWrite/goroutines-3200-8 180ns ± 1%
ChanWrite/goroutines-3280-8 181ns ± 2%
ChanWrite/goroutines-3360-8 183ns ± 2%
ChanWrite/goroutines-3440-8 188ns ± 3%
ChanWrite/goroutines-3520-8 190ns ± 2%
ChanWrite/goroutines-3600-8 193ns ± 2%
ChanWrite/goroutines-3680-8 196ns ± 3%
ChanWrite/goroutines-3760-8 199ns ± 2%
ChanWrite/goroutines-3840-8 206ns ± 2%
ChanWrite/goroutines-3920-8 209ns ± 2%
ChanWrite/goroutines-4000-8 206ns ± 2%
ChanWrite/goroutines-4080-8 209ns ± 2%
ChanWrite/goroutines-4160-8 208ns ± 2%
ChanWrite/goroutines-4240-8 209ns ± 3%
ChanWrite/goroutines-4320-8 213ns ± 2%
ChanWrite/goroutines-4400-8 209ns ± 2%
ChanWrite/goroutines-4480-8 211ns ± 1%
ChanWrite/goroutines-4560-8 213ns ± 2%
ChanWrite/goroutines-4640-8 215ns ± 1%
ChanWrite/goroutines-4720-8 218ns ± 3%

去 1.12.4。硬件:

CPU:       Quad core Intel Core i7-7700 (-MT-MCP-) cache: 8192 KB
clock speeds: max: 4200 MHz 1: 1109 MHz 2: 3641 MHz 3: 3472 MHz 4: 3514 MHz 5: 3873 MHz 6: 3537 MHz
7: 3410 MHz 8: 3016 MHz
CPU Flags: 3dnowprefetch abm acpi adx aes aperfmperf apic arat arch_perfmon art avx avx2 bmi1 bmi2
bts clflush clflushopt cmov constant_tsc cpuid cpuid_fault cx16 cx8 de ds_cpl dtes64 dtherm dts epb
ept erms est f16c flexpriority flush_l1d fma fpu fsgsbase fxsr hle ht hwp hwp_act_window hwp_epp
hwp_notify ibpb ibrs ida intel_pt invpcid invpcid_single lahf_lm lm mca mce md_clear mmx monitor
movbe mpx msr mtrr nonstop_tsc nopl nx pae pat pbe pcid pclmulqdq pdcm pdpe1gb pebs pge pln pni
popcnt pse pse36 pti pts rdrand rdseed rdtscp rep_good rtm sdbg sep smap smep smx ss ssbd sse sse2
sse4_1 sse4_2 ssse3 stibp syscall tm tm2 tpr_shadow tsc tsc_adjust tsc_deadline_timer tsc_known_freq
vme vmx vnmi vpid x2apic xgetbv1 xsave xsavec xsaveopt xsaves xtopology xtpr

更新:我在不同的硬件上进行了测试。看来问题依然存在:

enter image description here

长凳:https://play.golang.org/p/HnQ44--E4UQ


更新:

我的完整基准测试从 8 个 goroutine 到 15000 个 goroutine,包括对 chan/sync.Mutex/atomic 的比较:

enter image description here

最佳答案

sync.Mutex 的实现基于运行时信号量。之所以遇到性能大幅下降的原因是runtime.semacquire1的实现。

现在,让我们采样两个具有代表性的点,我们使用go tool pprof,当goroutines的数量等于2400和4800时:

goos: linux
goarch: amd64
BenchmarkMutexWrite/goroutines-2400-8 50000000 46.5 ns/op
PASS
ok 2.508s

BenchmarkMutexWrite/goroutines-4800-8 50000000 317 ns/op
PASS
ok 16.020s

2400:

enter image description here

4800:

enter image description here

正如我们所见,当 goroutines 的数量增加到 4800 时,runtime.gopark 的开销变得主要。让我们深入挖掘运行时源代码,看看究竟是谁调用了 runtime.gopark。在 runtime.semacquire1 中:

func semacquire1(addr *uint32, lifo bool, profile semaProfileFlags, skipframes int) {
// fast path
if cansemacquire(addr) {
return
}

s := acquireSudog()
root := semroot(addr)
...
for {
lock(&root.lock)
atomic.Xadd(&root.nwait, 1)
if cansemacquire(addr) {
atomic.Xadd(&root.nwait, -1)
unlock(&root.lock)
break
}

// slow path
root.queue(addr, s, lifo)
goparkunlock(&root.lock, waitReasonSemacquire, traceEvGoBlockSync, 4+skipframes)
if s.ticket != 0 || cansemacquire(addr) {
break
}
}
...
}

根据我们上面给出的 pprof 图,我们可以得出结论:

  1. 观察:当 2400 #goroutines 时,runtime.gopark 很少调用,而 runtime.mutex 调用很频繁。我们推断大部分代码是在慢路径之前完成的。

  2. 观察:runtime.gopark 在 4800 #goroutines 时大量调用。我们推断大部分代码都进入了慢速路径,当我们开始使用 runtime.gopark 时,必须考虑运行时调度程序上下文切换成本。

考虑到 Go 中的 channel 是基于操作系统同步原语实现的,而不涉及运行时调度程序,例如。 Linux 上的 Futex。因此其性能随着问题规模的增加呈线性下降。

以上解释了我们看到 sync.Mutex 性能大幅下降的原因。

关于performance - 为什么 sync.Mutex 在 goroutine 争用超过 3400 时会大幅降低性能?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57562606/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com