gpt4 book ai didi

python - 为什么通过共享内存的通信比通过队列慢得多?

转载 作者:太空狗 更新时间:2023-10-29 19:37:29 27 4
gpt4 key购买 nike

我在最近的老式 Apple MacBook Pro 上使用 Python 2.7.5,它有四个硬件和八个逻辑 CPU;即,sysctl 实用程序提供:

$ sysctl hw.physicalcpu
hw.physicalcpu: 4
$ sysctl hw.logicalcpu
hw.logicalcpu: 8

我需要对大型一维列表或数组执行一些相当复杂的处理,然后将结果保存为中间输出,稍后将在我的应用程序的后续计算中再次使用。我的问题的结构很自然地适合并行化,所以我想我会尝试使用 Python 的多处理模块将一维数组分割为几个部分(4 部分或 8 部分,我还不确定是哪个),执行并行计算,然后将结果输出重新组合成最终格式。我正在尝试决定是使用 multiprocessing.Queue()(消息队列)还是使用 multiprocessing.Array()(共享内存)作为我传达计算结果的首选机制从子进程回到主父进程,我一直在试验几个“玩具”模型,以确保我了解多处理模块的实际工作方式。然而,我遇到了一个相当出乎意料的结果:在为同一问题创建两个基本等效的解决方案时,使用共享内存进行进程间通信的版本似乎比使用消息的版本需要更多的执行时间(比如多 30 倍!)队列。下面,我为“玩具”问题提供了两个不同版本的示例源代码,该问题使用并行进程生成一长串随机数,并以两种不同的方式将聚合结果传回父进程:首先使用消息队列, 第二次使用共享内存。

这是使用消息队列的版本:

import random
import multiprocessing
import datetime

def genRandom(count, id, q):

print("Now starting process {0}".format(id))
output = []
# Generate a list of random numbers, of length "count"
for i in xrange(count):
output.append(random.random())
# Write the output to a queue, to be read by the calling process
q.put(output)

if __name__ == "__main__":
# Number of random numbers to be generated by each process
size = 1000000
# Number of processes to create -- the total size of all of the random
# numbers generated will ultimately be (procs * size)
procs = 4

# Create a list of jobs and queues
jobs = []
outqs = []
for i in xrange(0, procs):
q = multiprocessing.Queue()
p = multiprocessing.Process(target=genRandom, args=(size, i, q))
jobs.append(p)
outqs.append(q)

# Start time of the parallel processing and communications section
tstart = datetime.datetime.now()
# Start the processes (i.e. calculate the random number lists)
for j in jobs:
j.start()

# Read out the data from the queues
data = []
for q in outqs:
data.extend(q.get())

# Ensure all of the processes have finished
for j in jobs:
j.join()
# End time of the parallel processing and communications section
tstop = datetime.datetime.now()
tdelta = datetime.timedelta.total_seconds(tstop - tstart)

msg = "{0} random numbers generated in {1} seconds"
print(msg.format(len(data), tdelta))

当我运行它时,我得到的结果通常如下所示:

$ python multiproc_queue.py
Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 0.514805 seconds

现在,这里是等效的代码段,但稍作重构,以便它使用共享内存而不是队列:

import random
import multiprocessing
import datetime

def genRandom(count, id, d):

print("Now starting process {0}".format(id))
# Generate a list of random numbers, of length "count", and write them
# directly to a segment of an array in shared memory
for i in xrange(count*id, count*(id+1)):
d[i] = random.random()

if __name__ == "__main__":
# Number of random numbers to be generated by each process
size = 1000000
# Number of processes to create -- the total size of all of the random
# numbers generated will ultimately be (procs * size)
procs = 4

# Create a list of jobs and a block of shared memory
jobs = []
data = multiprocessing.Array('d', size*procs)
for i in xrange(0, procs):
p = multiprocessing.Process(target=genRandom, args=(size, i, data))
jobs.append(p)

# Start time of the parallel processing and communications section
tstart = datetime.datetime.now()
# Start the processes (i.e. calculate the random number lists)
for j in jobs:
j.start()

# Ensure all of the processes have finished
for j in jobs:
j.join()
# End time of the parallel processing and communications section
tstop = datetime.datetime.now()
tdelta = datetime.timedelta.total_seconds(tstop - tstart)

msg = "{0} random numbers generated in {1} seconds"
print(msg.format(len(data), tdelta))

然而,当我运行共享内存版本时,典型的结果看起来更像这样:

$ python multiproc_shmem.py 
Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 15.839607 seconds

我的问题:为什么我的两个版本的代码在执行速度上存在如此巨大的差异(大约 0.5 秒对 15 秒,相差 30 倍!)?特别是,如何修改共享内存版本以使其运行得更快?

最佳答案

这是因为multiprocessing.Array默认使用锁来防止多个进程同时访问它:

multiprocessing.Array(typecode_or_type, size_or_initializer, *, lock=True)

...

If lock is True (the default) then a new lock object is created to synchronize access to the value. If lock is a Lock or RLock object then that will be used synchronize access to the value. If lock is False then access to the returned object will not be automatically protected by a lock, so it will not necessarily be “process-safe”.

这意味着您并不是真正地同时写入数组——一次只有一个进程可以访问它。由于您的示例工作人员除了数组写入外几乎什么都不做,因此不断等待此锁会严重损害性能。如果在创建数组时使用 lock=False,性能会好很多:

lock=True:

Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 4.811205 seconds

lock=False:

Now starting process 0
Now starting process 3
Now starting process 1
Now starting process 2
4000000 random numbers generated in 0.192473 seconds

请注意,使用 lock=False 意味着您需要在执行进程不安全的操作时手动保护对 Array 的访问。你的例子是让进程写入独特的部分,所以没关系。但是,如果您在执行此操作时尝试从中读取数据,或者让不同的进程写入重叠部分,则需要手动获取锁。

关于python - 为什么通过共享内存的通信比通过队列慢得多?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25271723/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com