gpt4 book ai didi

Python多线程爬虫

转载 作者:太空狗 更新时间:2023-10-30 01:16:48 24 4
gpt4 key购买 nike

您好!我正在尝试用 python 编写网络爬虫。我想使用 python 多线程。即使在阅读了早期建议的论文和教程之后,我仍然有问题。我的代码在这里(整个源代码是 here ):

class Crawler(threading.Thread):

global g_URLsDict
varLock = threading.Lock()
count = 0

def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
self.url = self.queue.get()

def run(self):
while 1:
print self.getName()+" started"
self.page = getPage(self.url)
self.parsedPage = getParsedPage(self.page, fix=True)
self.urls = getLinksFromParsedPage(self.parsedPage)

for url in self.urls:

self.fp = hashlib.sha1(url).hexdigest()

#url-seen check
Crawler.varLock.acquire() #lock for global variable g_URLs
if self.fp in g_URLsDict:
Crawler.varLock.release() #releasing lock
else:
#print url+" does not exist"
Crawler.count +=1
print "total links: %d"%len(g_URLsDict)
print self.fp
g_URLsDict[self.fp] = url
Crawler.varLock.release() #releasing lock
self.queue.put(url)

print self.getName()+ " %d"%self.queue.qsize()
self.queue.task_done()
#self.queue.task_done()
#self.queue.task_done()


print g_URLsDict
queue = Queue.Queue()
queue.put("http://www.ertir.com")

for i in range(5):
t = Crawler(queue)
t.setDaemon(True)
t.start()

queue.join()

它不按需要工作,它在线程 1 之后不给出任何结果,并且它以不同的方式执行,有时会出现此错误:

Exception in thread Thread-2 (most likely raised during interpreter shutdown):

我该如何解决?而且我认为这并不比 for 循环更有效。

我已经尝试修复 run():

def run(self):
while 1:
print self.getName()+" started"
self.page = getPage(self.url)
self.parsedPage = getParsedPage(self.page, fix=True)
self.urls = getLinksFromParsedPage(self.parsedPage)

for url in self.urls:

self.fp = hashlib.sha1(url).hexdigest()

#url-seen check
Crawler.varLock.acquire() #lock for global variable g_URLs
if self.fp in g_URLsDict:
Crawler.varLock.release() #releasing lock
else:
#print url+" does not exist"
print self.fp
g_URLsDict[self.fp] = url
Crawler.varLock.release() #releasing lock
self.queue.put(url)

print self.getName()+ " %d"%self.queue.qsize()
#self.queue.task_done()
#self.queue.task_done()
self.queue.task_done()

我在不同的地方试验了 task_done() 命令,谁能解释一下区别?

最佳答案

您只在线程初始化时调用 self.url = self.queue.get()。如果您想获取新的 url 以进一步处理,您需要在 while 循环中尝试从队列中重新获取 url。

尝试将 self.page = getPage(self.url) 替换为 self.page = getPage(self.queue.get())。请注意,get 函数将无限期阻塞。您可能想在一段时间后超时并添加一些方法让您的后台线程按请求正常退出(这将消除您看到的异常)。

some good examples on effbot.org它以我上面描述的方式使用 get()。

编辑 - 对您最初评论的回答:

看看the docs for task_done() ;对于 get() 的每次调用(不会超时),您应该调用 task_done(),它会告诉对 join() 的任何阻塞调用该队列上的所有内容现在都已处理。每次调用 get() 都会阻塞(休眠),同时等待新的 url 发布到队列中。

Edit2 - 试试这个替代运行函数:

def run(self):
while 1:
print self.getName()+" started"
url = self.queue.get() # <-- note that we're blocking here to wait for a url from the queue
self.page = getPage(url)
self.parsedPage = getParsedPage(self.page, fix=True)
self.urls = getLinksFromParsedPage(self.parsedPage)

for url in self.urls:

self.fp = hashlib.sha1(url).hexdigest()

#url-seen check
Crawler.varLock.acquire() #lock for global variable g_URLs
if self.fp in g_URLsDict:
Crawler.varLock.release() #releasing lock
else:
#print url+" does not exist"
Crawler.count +=1
print "total links: %d"%len(g_URLsDict)
print self.fp
g_URLsDict[self.fp] = url
Crawler.varLock.release() #releasing lock
self.queue.put(url)

print self.getName()+ " %d"%self.queue.qsize()

self.queue.task_done() # <-- We've processed the url this thread pulled off the queue so indicate we're done with it.

关于Python多线程爬虫,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10800593/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com