python - 如何在 python 中使用 asyncio 和 wget 下载多个文件？-6ren

python - 如何在 python 中使用 asyncio 和 wget 下载多个文件？

转载作者：行者123 更新时间：2023-12-02 19:35:27

26

4

我想从dukaskopy 下载很多文件。典型的网址如下所示。

url = 'http://datafeed.dukascopy.com/datafeed/AUDUSD/2014/01/02/00h_ticks.bi5'

我试过答案 here但大多数文件的大小为 0。

但是当我简单地使用 wget 循环时(见下文)，我得到了完整的文件。

import wget
from urllib.error import HTTPError

pair = 'AUDUSD'
for year in range(2014,2015):
    for month in range(1,13):
        for day in range(1,32):
            for hour in range(24): 
                try:
                    url = 'http://datafeed.dukascopy.com/datafeed/' + pair + '/' + str(year) + '/' + str(month-1).zfill(2) + '/' + str(day).zfill(2) + '/' + str(hour).zfill(2) + 'h_ticks.bi5'
                    filename = pair + '-' + str(year) + '-' + str(month-1).zfill(2) + '-' + str(day).zfill(2) + '-' + str(hour).zfill(2) + 'h_ticks.bi5'
                    x = wget.download(url, filename)
#                     print(url)
                except HTTPError as err:
                    if err.code == 404:
                        print((year, month,day, hour))
                    else:
                        raise

我使用了以下 code更早用于抓取网站但不是用于下载文件。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from aiohttp import ClientSession, client_exceptions
from asyncio import Semaphore, ensure_future, gather, run
from json import dumps, loads

limit = 10
http_ok = [200]


async def scrape(url_list):
    tasks = list()
    sem = Semaphore(limit)

    async with ClientSession() as session:
        for url in url_list:
            task = ensure_future(scrape_bounded(url, sem, session))
            tasks.append(task)

        result = await gather(*tasks)

    return result


async def scrape_bounded(url, sem, session):
    async with sem:
        return await scrape_one(url, session)


async def scrape_one(url, session):
    try:
        async with session.get(url) as response:
            content = await response.read()
    except client_exceptions.ClientConnectorError:
        print('Scraping %s failed due to the connection problem', url)
        return False

    if response.status not in http_ok:
        print('Scraping%s failed due to the return code %s', url, response.status)
        return False

    content = loads(content.decode('UTF-8'))

    return content

if __name__ == '__main__':
    urls = ['http://demin.co/echo1/', 'http://demin.co/echo2/']
    res = run(scrape(urls))

    print(dumps(res, indent=4))

有使用多处理下载多个文件的答案here .但我认为 asyncio 可能会更快。

当返回 0 大小的文件时，可能是服务器限制了请求数量，但我仍然想探索是否有可能使用 wget 和 asyncio 下载多个文件。

最佳答案

这是一个例子。解码/编码以及写入操作应根据目标数据类型固定。

   #!/usr/bin/env python3
# -*- coding: utf-8 -*-

from aiofile import AIOFile
from aiohttp import ClientSession
from asyncio import ensure_future, gather, run, Semaphore
from calendar import monthlen
from lzma import open as lzma_open
from struct import calcsize, unpack
from io import BytesIO
from json import dumps

http_ok = [200]
limit = 5
base_url = 'http://datafeed.dukascopy.com/datafeed/{}/{}/{}/{}/{}h_ticks.bi5'
fmt = '>3i2f'
chunk_size = calcsize(fmt)


async def download():
    tasks = list()
    sem = Semaphore(limit)

    async with ClientSession() as session:
        for pair in ['AUDUSD']:
            for year in [2014, 2015]:
                for month in range(1, 12):
                    for day in range(1, monthlen(year, month)):
                        for hour in range(0, 23):
                            tasks.append(ensure_future(download_one(pair=pair,
                                                                    year=str(year).zfill(2),
                                                                    month=str(month).zfill(2),
                                                                    day=str(day).zfill(2),
                                                                    hour=str(hour).zfill(2),
                                                                    session=session,
                                                                    sem=sem)))
        return await gather(*tasks)


async def download_one(pair, year, month, day, hour, session, sem):
    url = base_url.format(pair, year, month, day, hour)
    data = list()

    async with sem:
        async with session.get(url) as response:
            content = await response.read()

        if response.status not in http_ok:
            print(f'Scraping {url} failed due to the return code {response.status}')
            return

        if content == b'':
            print(f'Scraping {url} failed due to the empty content')
            return

        with lzma_open(BytesIO(content)) as f:
            while True:
                chunk = f.read(chunk_size)
                if chunk:
                    data.append(unpack(fmt, chunk))
                else:
                    break

        async with AIOFile(f'{pair}-{year}-{month}-{day}-{hour}.bi5', 'w') as fl:
            await fl.write(dumps(data, indent=4))

        return


if __name__ == '__main__':
    run(download())

源代码可用here

关于python - 如何在 python 中使用 asyncio 和 wget 下载多个文件？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61105464/

26

4

0

文章推荐： docker - 带有无效标签或引用的Docker推送错误

文章推荐： aws-lambda - SpaCy 模型无法在 AWS Lambda 中加载

wget - Wget 中的默认用户代理
如果我从命令行使用 wget 而不指定显式用户代理，我想知道传递的默认用户代理是什么。我有一些基于用户代理更改输出的代码。 wget http://www.google.com -O test.ht
wget - wget 中请求之间的延迟
我想使用 wget 下载网络，但为了使它更像一个真正的用户，我想在请求之间进行小的随机延迟。我正在通过 cmd 执行 wget。最佳答案您可以将以下代码添加到命令行中，这会在服务器请求之间增加
wget - 如何使用 wget 部分重命名下载的文件？
我想从 ftp 服务器下载许多文件(大约 10000 个)。文件名太长。我只想用名称中的日期保存它们。例如:ABCDE201604120000-abcde.nc 我更喜欢20160412.nc可能吗？
wget - 使用 wget 伪造浏览器？
我想抓取一个网站来构建它的站点地图。问题是，该站点使用 htaccess 文件来阻止蜘蛛，因此以下命令仅下载主页(index.html)并停止，尽管它确实包含指向其他页面的链接: wget -mkE
wget - 如何让 wget 以正确的文件名保存
当我去ISC时here要下载 BIND，我的浏览器会自动正确保存下载的文件。例如，如果我点击 9.9.4-P2 的下载按钮，它会弹出一个窗口，如果我点击右侧的“BIND 9.9.4-P2 - tar.
wget - "wget -O"是什么意思？
我的 shell 脚本中有一个像这样的 wget 命令: reponse="`wget -O- http:localhost:8080/app/index.html`" 我不明白 -O- 选项。我被解
wget - 如何使用 Wget 进入登录页面？
我正在尝试使用Wget下载页面，但我无法通过登录屏幕。如何使用登录页面上的发布数据发送用户名/密码，然后以经过身份验证的用户身份下载实际页面？最佳答案基于手册页: # Log in to the
wget - 如何使用 wget 从盒子下载文件？
我创建了指向框中文件的直接链接: 上一个链接是浏览器网络界面，所以我随后分享了一个直接链接: 但是，如果我使用 wget 下载文件，我会收到垃圾。如何使用 wget 下载文件？最佳答案我可以通过
wget - 如何强制 wget 覆盖现有文件而忽略时间戳？
我尝试了“-N”和“--no-clobber”，但我得到的唯一结果是检索现有 example.exe 的新副本，其编号是使用此语法“example.exe.1”添加的数字'。这不是我想要得到的。我只需
wget - 如何使用 wget 而不是保留日期？
当我执行 wget 时，我希望文件系统中保存的文件具有现在的保存日期。不是服务器的日期。当我这样做时: ll -ltr 我首先下载了文件(列表中的最后一个)。以及如何将其设为默认值？什么时候是默认值
centos - 安装 wget 时没有可用的软件包 wget
在我的 CentOS 6.5 中，我想安装 wget: # yum -y install wget 但我收到以下错误: [root@localhost yum.repos.d]# yum -y i
wget - wget 的 -N 选项有问题
我正在尝试使用 wget 抓取网站。这是我的命令: wget -t 3 -N -k -r -x -N 表示“如果服务器版本低于本地版本，则不下载文件”。但这不起作用。当我重新启动上述抓取操作时，会一遍
wget - 使用 wget 覆盖文件但使用临时文件名直到收到完整文件，然后重命名
我在 cron 作业中使用 wget 每分钟将 .jpg 文件提取到 Web 服务器文件夹中(每次使用相同的文件名，覆盖)。此文件夹是“事件的”，因为 Web 服务器也从那里提供该图像。但是，如果有人
wget - 如何在使用 wget 镜像站点时跳过选定的 url
我有以下问题。我需要镜像受密码保护的站点。听起来很简单: wget -m -k -K -E --cookies=on --keep-session-cookies --load-cookies=myC
wget - 我可以使用 wget 来检查，但不能下载
我可以使用 wget 检查 404 错误而不实际下载资源吗？如果是这样怎么办？谢谢最佳答案命令行参数--spider正是用于此目的。在此模式下，wget 不会下载文件，如果找到资源，则返回值为零；
wget - 比较文件大小，如果不同则通过 wget 下载
我正在通过 wget 下载一些 .mp3 文件(全部合法): wget -r -nc files.myserver.com 有时我必须停止下载，此时文件已部分下载。例如，10 分钟的 record.m
wget - 连接到 https 页面时 wget 速度较慢
我正在使用 wget 连接到这样的安全站点: wget -nc -i inputFile 其中 inputeFile 由这样的 URL 组成: https://clientWebsite.com/Th
wget - 将使用 wget 下载的网页的目录索引重命名为 index.html
我目前正在使用一个相当复杂的 wget 命令，但它的本质是 -p 和 -k 标志来下载所有先决条件。如何将主要下载文件重命名为 index.html？比如我下载一个网页 http://myaweso
wget - 如何强制 wget 不进行 URL 编码？
这看起来应该很简单，但我无法弄清楚。我要发wget类似于以下的请求， wget http://www.foo.com/bar.cgi?param=\"p\" 但我不希望它对引号(或其他任何东西)进行
wget - 如何使用 wget 从 SourceForge 下载文件？
我正在编写一个需要从 sourceforge 下载发布文件的脚本。如何获得好的链接？同样的问题及其答案于 2013 年在此处给出，但不再有效。 https://unix.stackexchange.

首页

博学

6Ren·AI

商城

python - 如何在 python 中使用 asyncio 和 wget 下载多个文件？