gpt4 book ai didi

python - 如何防止引发 asyncio.TimeoutError 并继续循环

转载 作者:行者123 更新时间:2023-12-01 08:51:57 57 4
gpt4 key购买 nike

我使用aiohttp和limited_as_completed方法来加速抓取(大约1亿个静态网站页面)。但是,代码会在几分钟后停止,并返回 TimeoutError。我尝试了几种方法,但仍然无法阻止引发 asyncio.TimeoutError 。请问如何忽略该错误并继续?

我正在运行的代码是:

N=123
import html
from lxml import etree
import requests
import asyncio
import aiohttp
from aiohttp import ClientSession, TCPConnector
import pandas as pd
import re
import csv
import time
from itertools import islice
import sys
from contextlib import suppress

start = time.time()
data = {}
data['name'] = []
filename = "C:\\Users\\xxxx"+ str(N) + ".csv"

def limited_as_completed(coros, limit):
futures = [
asyncio.ensure_future(c)
for c in islice(coros, 0, limit)
]
async def first_to_finish():
while True:
await asyncio.sleep(0)
for f in futures:
if f.done():
futures.remove(f)
try:
newf = next(coros)
futures.append(
asyncio.ensure_future(newf))
except StopIteration as e:
pass
return f.result()
while len(futures) > 0:
yield first_to_finish()

async def get_info_byid(i, url, session):
async with session.get(url,timeout=20) as resp:
print(url)
with suppress(asyncio.TimeoutError):
r = await resp.text()
name = etree.HTML(r).xpath('//h2[starts-with(text(),"Customer Name")]/text()')
data['name'].append(name)
dataframe = pd.DataFrame(data)
dataframe.to_csv(filename, index=False, sep='|')

limit = 1000
async def print_when_done(tasks):
for res in limited_as_completed(tasks, limit):
await res

url = "http://xxx.{}.html"
loop = asyncio.get_event_loop()

async def main():
connector = TCPConnector(limit=10)
async with ClientSession(connector=connector,headers=headers,raise_for_status=False) as session:
coros = (get_info_byid(i, url.format(i), session) for i in range(N,N+1000000))
await print_when_done(coros)

loop.run_until_complete(main())
loop.close()
print("took", time.time() - start, "seconds.")

错误日志是:

Traceback (most recent call last):
File "C:\Users\xxx.py", line 111, in <module>
loop.run_until_complete(main())
File "C:\Users\xx\AppData\Local\Programs\Python\Python37-32\lib\asyncio\base_events.py", line 573, in run_until_complete
return future.result()
File "C:\Users\xxx.py", line 109, in main
await print_when_done(coros)
File "C:\Users\xxx.py", line 98, in print_when_done
await res
File "C:\Users\xxx.py", line 60, in first_to_finish
return f.result()
File "C:\Users\xxx.py", line 65, in get_info_byid
async with session.get(url,timeout=20) as resp:
File "C:\Users\xx\AppData\Local\Programs\Python\Python37-32\lib\site-packages\aiohttp\client.py", line 855, in __aenter__
self._resp = await self._coro
File "C:\Users\xx\AppData\Local\Programs\Python\Python37-32\lib\site-packages\aiohttp\client.py", line 391, in _request
await resp.start(conn)
File "C:\Users\xx\AppData\Local\Programs\Python\Python37-32\lib\site-packages\aiohttp\client_reqrep.py", line 770, in start
self._continue = None
File "C:\Users\xx\AppData\Local\Programs\Python\Python37-32\lib\site-packages\aiohttp\helpers.py", line 673, in __exit__
raise asyncio.TimeoutError from None
concurrent.futures._base.TimeoutError

我已经尝试过了1)添加期望asyncio.TimeoutError:通过。不工作

async def get_info_byid(i, url, session):
async with session.get(url,timeout=20) as resp:
print(url)
try:
r = await resp.text()
name = etree.HTML(r).xpath('//h2[starts-with(text(),"Customer Name")]/text()')
data['name'].append(name)
dataframe = pd.DataFrame(data)
dataframe.to_csv(filename, index=False, sep='|')
except asyncio.TimeoutError:
pass

2) 抑制(asyncio.TimeoutError),如上所示。不工作

我昨天刚刚学习了aiohttp,所以也许我的代码中有其他问题导致运行几分钟后才出现超时错误?如果有人知道如何处理,非常感谢!

最佳答案

@Yurii Kramarenko 所做的事情肯定会引发未关闭的客户端 session 异常,因为 session 从未正确关闭过。我推荐的是这样的:

import asyncio
import aiohttp

async def main(urls):
async with aiohttp.ClientSession(timeout=self.timeout) as session:
tasks=[self.do_something(session,url) for url in urls]
await asyncio.gather(*tasks)

关于python - 如何防止引发 asyncio.TimeoutError 并继续循环,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53049523/

57 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com