gpt4 book ai didi

python - 用于在 Python 中下载 NCBI 文件的多线程

转载 作者:太空宇宙 更新时间:2023-11-04 10:37:35 25 4
gpt4 key购买 nike

所以最近我承担了从 ncbi 数据库下载大量文件的任务。但是,我遇到过必须创建多个数据库的情况。此代码可用于从 ncbi 网站下载所有病毒。我的问题是有什么方法可以加快下载这些文件的过程。

目前该程序的运行时间超过 5 小时。我已经研究过多线程,但永远无法让它工作,因为其中一些文件下载时间超过 10 秒,而且我不知道如何处理停顿。 (编程新手)还有一种处理 urllib2.HTTPError: HTTP Error 502: Bad Gateway 的方法。我有时会通过某些 retstart 和 retmax 的组合得到这个。这会使程序崩溃,我必须通过更改 for 语句中的 0 从不同的位置重新开始下载。

import urllib2
from BeautifulSoup import BeautifulSoup

#This is the SearchQuery into NCBI. Spaces are replaced with +'s.
SearchQuery = 'viruses[orgn]+NOT+Retroviridae[orgn]'
#This is the Database that you are searching.
database = 'protein'
#This is the output file for the data
output = 'sample.fasta'


#This is the base url for NCBI eutils.
base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
#Create the search string from the information above
esearch = 'esearch.fcgi?db='+database+'&term='+SearchQuery+'&usehistory=y'
#Create your esearch url
url = base + esearch
#Fetch your esearch using urllib2
print url
content = urllib2.urlopen(url)
#Open url in BeautifulSoup
doc = BeautifulSoup(content)
#Grab the amount of hits in the search
Count = int(doc.find('count').string)
#Grab the WebEnv or the history of this search from usehistory.
WebEnv = doc.find('webenv').string
#Grab the QueryKey
QueryKey = doc.find('querykey').string
#Set the max amount of files to fetch at a time. Default is 500 files.
retmax = 10000
#Create the fetch string
efetch = 'efetch.fcgi?db='+database+'&WebEnv='+WebEnv+'&query_key='+QueryKey
#Select the output format and file format of the files.
#For table visit: http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1
format = 'fasta'
type = 'text'
#Create the options string for efetch
options = '&rettype='+format+'&retmode='+type


#For statement 0 to Count counting by retmax. Use xrange over range
for i in xrange(0,Count,retmax):
#Create the position string
poision = '&retstart='+str(i)+'&retmax='+str(retmax)
#Create the efetch URL
url = base + efetch + poision + options
print url
#Grab the results
response = urllib2.urlopen(url)
#Write output to file
with open(output, 'a') as file:
for line in response.readlines():
file.write(line)
#Gives a sense of where you are
print Count - i - retmax

最佳答案

使用多线程下载文件:

#!/usr/bin/env python
import shutil
from contextlib import closing
from multiprocessing.dummy import Pool # use threads
from urllib2 import urlopen

def generate_urls(some, params): #XXX pass whatever parameters you need
for restart in range(*params):
# ... generate url, filename
yield url, filename

def download((url, filename)):
try:
with closing(urlopen(url)) as response, open(filename, 'wb') as file:
shutil.copyfileobj(response, file)
except Exception as e:
return (url, filename), repr(e)
else: # success
return (url, filename), None

def main():
pool = Pool(20) # at most 20 concurrent downloads
urls = generate_urls(some, params)
for (url, filename), error in pool.imap_unordered(download, urls):
if error is not None:
print("Can't download {url} to {filename}, "
"reason: {error}".format(**locals())

if __name__ == "__main__":
main()

关于python - 用于在 Python 中下载 NCBI 文件的多线程,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22585819/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com