gpt4 book ai didi

python - 使用 gevent 下载多个文件

转载 作者:太空宇宙 更新时间:2023-11-03 18:22:39 25 4
gpt4 key购买 nike

我正在尝试利用 [gevent][1] 并行下载文件列表

我的代码是对建议的代码here稍作修改

monkey.patch_all()

def download_xbrl_files(download_folder, yq, list_of_xbrl_urls):
def download_and_save_file(url, yr, qtr):
if url is not None:
full_url = "http://edgar.sec.gov" + url
if not os.path.exists(full_url):
try:
content = urllib2.urlopen(full_url).read()
filename = download_folder + "/" + str(y) + "/" + q + "/" + url.split('/')[-1]
print "Saving: ", filename
f_raw = open(filename, "w")
f = FileObject(f_raw, "w")
try:
f.write(content)
finally:
f.close()
return 'Done'
except:
print "Warning: can't save or access for item:", url
return None
else:
return 'Exists'
else:
return None
(y, q) = yq
if utls.has_elements(list_of_xbrl_urls):
filter_for_none = filter(lambda x: x is not None, list_of_xbrl_urls)
no_duplicates = list(set(filter_for_none))
download_files = [gevent.spawn(lambda x: download_and_save_file(x, y, q), x) for x in no_duplicates]
gevent.joinall(download_files)
return 'completed'
else:
return 'empty'

代码的作用是:

  1. 经过一些清洁
  2. gevent.spawn 生成 download_and_save_file 其中:
  3. 检查文件是否已下载
  4. 如果没有,则使用 urllib2.urlopen(full_url).read() 下载内容
  5. gevent's FileObject 的帮助下保存文件

我的印象是 download_and_save 只能按顺序工作。此外,我的应用程序处于待机状态。我可以添加超时,但我想在代码中优雅地处理失败。

想知道我是否做错了什么 - 这是我第一次用 python 编写代码。

编辑

这是使用“线程”的代码版本

def download_xbrl_files(download_folder, yq_and_url):
(yq, url) = yq_and_url
(yr, qtr) = yq
if url is not None and url is not '':
full_url = "http://edgar.sec.gov" + url
filename = download_folder + "/" + str(yr) + "/" + qtr + "/" + url.split('/')[-1]
if not os.path.exists(filename):
try:
content = urllib2.urlopen(full_url).read()
print "Saving: ", filename
f = open(filename, "wb")
try:
f.write(content)
print "Writing done: ", filename
finally:
f.close()
return 'Done'
except:
print "Warning: can't save or access for item:", url
return None
else:
print "Exists: ", filename
return 'Exists'
else:
return None


def download_filings(download_folder, yq_and_filings):
threads = [threading.Thread(target=download_xbrl_files, args=(download_folder, x,)) for x in yq_and_filings]
[thread.start() for thread in threads]
[thread.join() for thread in threads]

最佳答案

我对此进行了更深入的研究,基本问题是 gevent.spawn() 创建 greenlet 而不是进程(所有 greenlet 都在单个操作系统线程中运行)。

尝试一个简单的:

import gevent
from time import sleep
g = [gevent.spawn(sleep, 1) for x in range(100)]
gevent.joinall(g)

您会看到这花费的时间是 100 秒。这就证明了上面的观点。

您确实在寻找多线程,可以在线程模块中找到它。看看这个问题:How to use threading in Python? 。了解一些如何操作。

---更新---

以下是如何执行此操作的简单示例:

threads = [threading.Thread(target=sleep, args=(1,)) for x in range(10)]
[thread.start() for thread in threads]
[thread.join() for thread in threads]

关于python - 使用 gevent 下载多个文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23823257/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com