gpt4 book ai didi

parallel-processing - Julia:为什么共享内存多线程不给我加速?

转载 作者:行者123 更新时间:2023-12-04 00:30:13 24 4
gpt4 key购买 nike

我想在 Julia 中使用共享内存多线程。正如 Threads.@threads 宏所做的那样,我可以使用 ccall(:jl_threading_run ...) 来执行此操作。虽然我的代码现在并行运行,但我没有得到预期的加速。

以下代码旨在作为我正在采用的方法和我遇到的性能问题的最小示例:[编辑:请参阅稍后了解更多最小示例]

nthreads = Threads.nthreads()
test_size = 1000000
println("STARTED with ", nthreads, " thread(s) and test size of ", test_size, ".")
# Something to be processed:
objects = rand(test_size)
# Somewhere for our results
results = zeros(nthreads)
counts = zeros(nthreads)
# A function to do some work.
function worker_fn()
work_idx = 1
my_result = results[Threads.threadid()]
while work_idx > 0
my_result += objects[work_idx]
work_idx += nthreads
if work_idx > test_size
break
end
counts[Threads.threadid()] += 1
end
end

# Call our worker function using jl_threading_run
@time ccall(:jl_threading_run, Ref{Cvoid}, (Any,), worker_fn)

# Verify that we made as many calls as we think we did.
println("\nCOUNTS:")
println("\tPer thread:\t", counts)
println("\tSum:\t\t", sum(counts))

在 i7-7700 上,典型的单线程结果是:

STARTED with 1 thread(s) and test size of 1000000.
0.134606 seconds (5.00 M allocations: 76.563 MiB, 1.79% gc time)

COUNTS:
Per thread: [999999.0]
Sum: 999999.0

并且有 4 个线程:

STARTED with 4 thread(s) and test size of 1000000.
0.140378 seconds (1.81 M allocations: 25.661 MiB)

COUNTS:
Per thread: [249999.0, 249999.0, 249999.0, 249999.0]
Sum: 999996.0

多线程会减慢速度!为什么?

编辑:可以创建一个更好的最小示例 @threads 宏本身。

a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
@time Threads.@threads for i = 1 : test_size
a[Threads.threadid()] += b[i]
calls[Threads.threadid()] += 1
end

我错误地认为 @threads 宏包含在 Julia 中意味着会有好处。

最佳答案

您遇到的问题很可能是 false sharing .

您可以通过像这样将您写入的区域分开足够远来解决它(这里有一个“快速而肮脏”的实现来展示更改的本质):

julia> function f(spacing)
test_size = 1000000
a = zeros(Threads.nthreads()*spacing)
b = rand(test_size)
calls = zeros(Threads.nthreads()*spacing)
Threads.@threads for i = 1 : test_size
@inbounds begin
a[Threads.threadid()*spacing] += b[i]
calls[Threads.threadid()*spacing] += 1
end
end
a, calls
end
f (generic function with 1 method)

julia> @btime f(1);
41.525 ms (35 allocations: 7.63 MiB)

julia> @btime f(8);
2.189 ms (35 allocations: 7.63 MiB)

或者像这样在局部变量上进行每个线程的累积(这是一种首选方法,因为它应该更快):

function getrange(n)
tid = Threads.threadid()
nt = Threads.nthreads()
d , r = divrem(n, nt)
from = (tid - 1) * d + min(r, tid - 1) + 1
to = from + d - 1 + (tid ≤ r ? 1 : 0)
from:to
end

function f()
test_size = 10^8
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
Threads.@threads for k = 1 : Threads.nthreads()
local_a = 0.0
local_c = 0.0
for i in getrange(test_size)
for j in 1:10
local_a += b[i]
local_c += 1
end
end
a[Threads.threadid()] = local_a
calls[Threads.threadid()] = local_c
end
a, calls
end

另请注意,您可能在具有 2 个物理内核(并且只有 4 个虚拟内核)的机器上使用 4 个线程,因此线程的 yield 不会是线性的。

关于parallel-processing - Julia:为什么共享内存多线程不给我加速?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52593588/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com