gpt4 book ai didi

Python多处理队列使代码卡在大数据上

转载 作者:行者123 更新时间:2023-12-05 05:06:21 25 4
gpt4 key购买 nike

我正在使用 python 的多处理来分析一些大文本。经过几天试图弄清楚为什么我的代码挂起(即进程没有结束)后,我能够使用以下简单代码重现问题:

import multiprocessing as mp

for y in range(65500, 65600):
print(y)

def func(output):

output.put("a"*y)

if __name__ == "__main__":

output = mp.Queue()

process = mp.Process(target = func, args = (output,))

process.start()

process.join()

如您所见,如果放入队列的项目太大,进程就会挂起。它不会卡住,如果我在 output.put() 之后编写更多代码,它会运行,但进程永远不会停止。

当字符串达到 65500 个字符时,就会开始发生这种情况,具体取决于您的解释器,它可能会有所不同。

我知道 mp.Queue 有一个 maxsize 参数,但做了一些搜索我发现它是关于队列的项目数量的大小,而不是大小项目本身。

有解决办法吗?我需要在原始代码中放入队列中的数据非常非常大...

最佳答案

您的队列已满,没有消费者可以清空它。

来自Queue.put的定义:

If the optional argument block is True (the default) and timeout is None (the default), block if necessary until a free slot is available.

假设生产者和消费者之间没有死锁(并且假设您的原始代码确实有消费者,因为您的示例没有),最终生产者应该被解锁并终止。检查您的消费者的代码(或将其添加到问题中,以便我们查看)


更新

This is not the problem, because queue has not been given a maxsize so put should succeed until you run out of memory.

这不是Queue的行为。如本 ticket 所述,这里阻塞的部分不是队列本身,而是底层的管道。来自链接资源(“[]”之间的插入是我的):

A queue works like this: - when you call queue.put(data), the data is added to a deque, which can grow and shrink forever - then a thread pops elements from the deque, and sends them so that the other process can receive them through a pipe or a Unix socket (created via socketpair). But, and that's the important point, both pipes and unix sockets have a limited capacity (used to be 4k - pagesize - on older Linux kernels for pipes, now it's 64k, and between 64k-120k for unix sockets, depending on tunable systcls). - when you do queue.get(), you just do a read on the pipe/socket

[..] when size [becomes too big] the writing thread blocks on the write syscall. And since a join is performed before dequeing the item [note: that's your process.join], you just deadlock, since the join waits for the sending thread to complete, and the write can't complete since the pipe/socket is full! If you dequeue the item before waiting the submitter process, everything works fine.


更新2

I understand. But I don't actually have a consumer (if it is what I'm thinking it is), I will only get the results from the queue when process has finished putting it into the queue.

是的,这就是问题所在。 multiprocessing.Queue 不是存储容器。您应该专门使用它在“生产者”(生成进入队列的数据的进程)和“消费者”(“使用”该数据的进程)之间传递数据。如您所知,将数据留在那里是个坏主意.

How can I get an item from the queue if I cannot even put it there first?

putget 隐藏了在数据填满管道时将数据放在一起的问题,因此您只需要在“main”中设置一个循环从队列中获取项的过程,例如,将它们附加到列表中。该列表在主进程的内存空间中,不会堵塞管道。

关于Python多处理队列使代码卡在大数据上,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59951832/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com