gpt4 book ai didi

python - 将任务添加到 python asyncio

转载 作者:太空宇宙 更新时间:2023-11-04 08:55:01 25 4
gpt4 key购买 nike

我正在尝试编写一个简单的网络爬虫以测试新的 asyncio 模块的工作原理,但我遇到了一些错误。我正在尝试使用单个 URL 启动爬虫。该脚本应下载该页面,找到任何 <a>页面上的标签,并安排它们也被下载。我期望的输出是一堆行,指示第一页已下载,然后是随机顺序的后续页面(即下载时)直到全部完成,但实际上它们似乎只是按顺序下载。总的来说,我对异步尤其是这个模块完全陌生,所以我确信我只缺少一些基本概念。

到目前为止,这是我的代码:

import asyncio
import re
import requests
import time
from bs4 import BeautifulSoup
from functools import partial

@asyncio.coroutine
def get_page(url, depth=0):
print('%s: Getting %s' % (time.time(), url))
page = requests.get(url)
print('%s: Got %s' % (time.time(), url))
soup = BeautifulSoup(page.text)
if depth < 2:
for a in soup.find_all('a', href=re.compile(r'\w+\.html'))[:3]:
u = 'https://docs.python.org/3/' + a['href']
print('%s: Scheduling %s' % (time.time(), u))
yield from get_page(u, depth+1)
if depth == 0:
loop.stop()
return soup

root = 'https://docs.python.org/3/'
loop = asyncio.get_event_loop()
loop.create_task(get_page(root))
loop.run_forever()

这是输出:

1434971882.3458219: Getting https://docs.python.org/3/
1434971893.0054126: Got https://docs.python.org/3/
1434971893.015218: Scheduling https://docs.python.org/3/genindex.html
1434971893.0153584: Getting https://docs.python.org/3/genindex.html
1434971894.464993: Got https://docs.python.org/3/genindex.html
1434971894.4752269: Scheduling https://docs.python.org/3/py-modindex.html
1434971894.4753256: Getting https://docs.python.org/3/py-modindex.html
1434971896.9845033: Got https://docs.python.org/3/py-modindex.html
1434971897.0756354: Scheduling https://docs.python.org/3/index.html
1434971897.0757186: Getting https://docs.python.org/3/index.html
1434971907.451529: Got https://docs.python.org/3/index.html
1434971907.4600112: Scheduling https://docs.python.org/3/genindex-Symbols.html
1434971907.4600625: Getting https://docs.python.org/3/genindex-Symbols.html
1434971917.6517148: Got https://docs.python.org/3/genindex-Symbols.html
1434971917.6789174: Scheduling https://docs.python.org/3/py-modindex.html
1434971917.6789672: Getting https://docs.python.org/3/py-modindex.html
1434971919.454042: Got https://docs.python.org/3/py-modindex.html
1434971919.574361: Scheduling https://docs.python.org/3/genindex.html
1434971919.574434: Getting https://docs.python.org/3/genindex.html
1434971920.5942516: Got https://docs.python.org/3/genindex.html
1434971920.6020699: Scheduling https://docs.python.org/3/index.html
1434971920.6021295: Getting https://docs.python.org/3/index.html
1434971922.1504402: Got https://docs.python.org/3/index.html
1434971922.1589775: Scheduling https://docs.python.org/3/library/__future__.html#module-__future__
1434971922.1590302: Getting https://docs.python.org/3/library/__future__.html#module-__future__
1434971923.30988: Got https://docs.python.org/3/library/__future__.html#module-__future__
1434971923.3215268: Scheduling https://docs.python.org/3/whatsnew/3.4.html
1434971923.321574: Getting https://docs.python.org/3/whatsnew/3.4.html
1434971926.6502898: Got https://docs.python.org/3/whatsnew/3.4.html
1434971926.89331: Scheduling https://docs.python.org/3/../genindex.html
1434971926.8934016: Getting https://docs.python.org/3/../genindex.html
1434971929.0996494: Got https://docs.python.org/3/../genindex.html
1434971929.1068246: Scheduling https://docs.python.org/3/../py-modindex.html
1434971929.1068716: Getting https://docs.python.org/3/../py-modindex.html
1434971932.5949798: Got https://docs.python.org/3/../py-modindex.html
1434971932.717457: Scheduling https://docs.python.org/3/3.3.html
1434971932.7175465: Getting https://docs.python.org/3/3.3.html
1434971934.009238: Got https://docs.python.org/3/3.3.html

最佳答案

使用 asyncio 并不能神奇地使所有代码异步。在这种情况下,requests 是阻塞的,因此您所有的协程都将等待它。

有一个名为 aiohttp 的异步库允许异步 http 请求,尽管它不像 requests 那样用户友好。

关于python - 将任务添加到 python asyncio,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30977988/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com