gpt4 book ai didi

python - 多处理 : optimize CPU usage for concurrent HTTP async requests

转载 作者:行者123 更新时间:2023-12-03 18:53:26 26 4
gpt4 key购买 nike

我需要下载站点/URL 列表(可能会随时间变化),我目前使用 multiprocessing.Manager().Queue() 来提交和更新所述列表。
我必须每秒检查每个 URL/任务:因此每个任务基本上永远不会结束(直到满足特定条件,如用户中断)。我认为 multiprocessing.Process() 结合了 asyncio 和一个很好的 async HTTP client会解决问题。不幸的是,在提交 50 个或更多 URL 后,我的 CPU 使用率仍然很高。当任务不执行任何请求 - 运行 mock_request() - 和当它们 - 运行 do_request() - 时,你会自己注意到差异。

这里有一个重现每个案例的例子(随时按 CTRL+C 优雅地结束它)。

import asyncio, os, sys, time, httpx
import multiprocessing
import queue as Queue

class ExitHandler(object):
def __init__(self, manager, queue, processes):
self.manager = manager
self.queue = queue
self.processes = processes

def set_exit_handler(self):
if os.name == "nt":
try:
import win32api
win32api.SetConsoleCtrlHandler(self.on_exit, True)
except ImportError:
version = ".".join(map(str, sys.version_info[:2]))
raise Exception("pywin32 not installed for Python " + version)
else:
import signal
signal.signal(signal.SIGINT, self.on_exit)
#signal.signal(signal.CTRL_C_EVENT, func)
signal.signal(signal.SIGTERM, self.on_exit)

def on_exit(self, sig, func=None):
print('[Main process]: exit triggered, terminating all workers')
STOP_WAIT_SECS= 5
for _ in range(N_WORKERS):
self.queue.put('END')

try:
end_time = time.time() + STOP_WAIT_SECS
# wait up to STOP_WAIT_SECS for all processes to complete
for proc in self.processes:
join_secs = max(0.0, min(end_time - time.time(), STOP_WAIT_SECS))
proc.join(join_secs)

# clear the procs list and _terminate_ any procs that have not yet exited
while self.processes and len(self.processes) > 0:
proc = self.processes.pop()
if proc.is_alive():
proc.terminate()

self.manager.shutdown()

# finally, kill this thread and any running
os._exit(0)
except Exception:
pass

async def mock_request(url):

# we won't do any request here, it's just an example of how much less CPU
# each process consumes when not doing requests

x = 0
while True:
try:
x += 1
print('Finished downloading {}'.format(url))
await asyncio.sleep(1)
except asyncio.CancelledError:
return

async def do_request(url):

while True:
try:
# I use httpx (https://github.com/encode/httpx/) as async client for its simplicity
# feel free to use your preferred library (e.g. aiohttp)
async with httpx.AsyncClient() as s:
await s.get(url)
print('Finished downloading {}'.format(url))
await asyncio.sleep(1)
except asyncio.CancelledError:
return

def worker(queue):

try:
event_loop = asyncio.get_event_loop()
event_loop.run_until_complete(request_worker(queue))
except KeyboardInterrupt:
pass

async def request_worker(queue):

p = multiprocessing.current_process()
loop = asyncio.get_event_loop()

while True:
try:
task = await loop.run_in_executor(None, queue.get)

if task == 'END':
break

elif task['action'] == 'DOWNLOAD':
print('Worker {}: Received new task'.format(p.name))
f = loop.create_task(do_request(task['url'])) # high CPU usage
# f = loop.create_task(mock_request(task['url'])) # low (almost none) CPU usage

except KeyboardInterrupt:
pass
except Queue.Empty:
pass

print('Task Worker {}: ending'.format(p.name))

def run_workers(queue, processes):

print('Starting workers')

for _ in range(N_WORKERS):
processes.append(multiprocessing.Process(target=worker, args=(queue,)))

task = {
'action': 'DOWNLOAD',
'url': 'https://google.com'
}

# this is just an example forcing the same URL * 100 times, while in reaility
# it will be 1 different URL per task
for _ in range(100):
queue.put(task)

for p in processes:
p.start()

for p in processes:
p.join()

return True

if __name__ == "__main__":
processes = []
N_WORKERS = 8 # processes to spawn
manager = multiprocessing.Manager()
q = manager.Queue() # main queue to send URLs to

# just a useful clean exit handler (press CTRL+C to terminate)
exit_handler = ExitHandler(manager, q, processes)
exit_handler.set_exit_handler()

# start the workers
run_workers(q, processes)

这只是一个例子,说明当同时执行请求时,每个进程消耗了多少 CPU:

cpu usage

任何显着降低 CPU 使用率(保持每秒相同数量的请求)的解决方案都被接受,无论它是否使用多处理。对我来说,唯一必须的是异步模式。

最佳答案

这很突出:

while True:
try:
async with httpx.AsyncClient() as s:

这会为每个请求初始化一个新的客户端,通过查看实现,它会导入并初始化一个 SSL 上下文。在我看来,这些都是昂贵的操作,因此在循环内运行它们可能会消耗如此多的 CPU。

相反,考虑将代码重新排序为

async with httpx.AsyncClient() as s:
while True:
try:

关于python - 多处理 : optimize CPU usage for concurrent HTTP async requests,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66420899/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com