gpt4 book ai didi

python - Urllib2 & BeautifulSoup : Nice couple but too slow - urllib3 & threads?

转载 作者:太空狗 更新时间:2023-10-29 22:16:38 24 4
gpt4 key购买 nike

当我听到有关线程和 urllib3 的一些好消息时,我正在寻找一种方法来优化我的代码。显然,人们不同意哪种解决方案是最好的。

下面我的脚本的问题是执行时间:太慢了!

第 1 步:我获取此页面 http://www.cambridgeesol.org/institutions/results.php?region=Afghanistan&type=&BULATS=on

第 2 步:我用 BeautifulSoup 解析页面

第 3 步:我将数据放入 excel 文档中

第 4 步:我对我列表(大列表)中的所有国家/地区一次又一次地执行此操作(我只是将 url 中的“阿富汗”更改为另一个国家)

这是我的代码:

ws = wb.add_sheet("BULATS_IA") #We add a new tab in the excel doc
x = 0 # We need x and y for pulling the data into the excel doc
y = 0
Countries_List = ['Afghanistan','Albania','Andorra','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam']
Longueur = len(Countries_List)



for Countries in Countries_List:
y = 0

htmlSource = urllib.urlopen("http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % (Countries)).read() # I am opening the page with the name of the correspondant country in the url
s = soup(htmlSource)
tableGood = s.findAll('table')
try:
rows = tableGood[3].findAll('tr')
for tr in rows:
cols = tr.findAll('td')
y = 0
x = x + 1
for td in cols:
hum = td.text
ws.write(x,y,hum)
y = y + 1
wb.save("%s.xls" % name_excel)

except (IndexError):
pass

所以我知道一切并不完美,但我期待在 Python 中学习新事物!该脚本非常慢,因为 urllib2 和 BeautifulSoup 没有那么快。对于汤的事情,我想我真的不能让它变得更好,但对于 urllib2,我没有。

编辑 1: Multiprocessing useless with urllib2?对我来说似乎很有趣。你们如何看待这个潜在的解决方案?!

# Make sure that the queue is thread-safe!!

def producer(self):
# Only need one producer, although you could have multiple
with fh = open('urllist.txt', 'r'):
for line in fh:
self.queue.enqueue(line.strip())

def consumer(self):
# Fire up N of these babies for some speed
while True:
url = self.queue.dequeue()
dh = urllib2.urlopen(url)
with fh = open('/dev/null', 'w'): # gotta put it somewhere
fh.write(dh.read())

编辑 2:URLLIB3任何人都可以告诉我更多关于那个的事情吗?

Re-use the same socket connection for multiple requests (HTTPConnectionPool and HTTPSConnectionPool) (with optional client-side certificate verification). https://github.com/shazow/urllib3

就我为不同页面请求同一个网站 122 次而言,我想重用同一个套接字连接会很有趣,我错了吗?不能快点吗? ...

http = urllib3.PoolManager()
r = http.request('GET', 'http://www.bulats.org')
for Pages in Pages_List:
r = http.request('GET', 'http://www.bulats.org/agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=%s' % (Pages))
s = soup(r.data)

最佳答案

考虑使用类似 workerpool 的东西.引用Mass Downloader例如,结合 urllib3看起来像:

import workerpool
import urllib3

URL_LIST = [] # Fill this from somewhere

NUM_SOCKETS = 3
NUM_WORKERS = 5

# We want a few more workers than sockets so that they have extra
# time to parse things and such.

http = urllib3.PoolManager(maxsize=NUM_SOCKETS)
workers = workerpool.WorkerPool(size=NUM_WORKERS)

class MyJob(workerpool.Job):
def __init__(self, url):
self.url = url

def run(self):
r = http.request('GET', self.url)
# ... do parsing stuff here


for url in URL_LIST:
workers.put(MyJob(url))

# Send shutdown jobs to all threads, and wait until all the jobs have been completed
# (If you don't do this, the script might hang due to a rogue undead thread.)
workers.shutdown()
workers.wait()

您可能会从 Mass Downloader 示例中注意到有多种方法可以做到这一点。我选择这个特定示例只是因为它不那么神奇,但任何其他策略也都有效。

免责声明:我是 urllib3 和 workerpool 的作者。

关于python - Urllib2 & BeautifulSoup : Nice couple but too slow - urllib3 & threads?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10265115/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com