gpt4 book ai didi

python - 如何在并发方法中使用集合和队列来验证已经完成的作业,以免再次对它们进行排队?

转载 作者:太空宇宙 更新时间:2023-11-03 14:18:12 25 4
gpt4 key购买 nike

这是一个用于查找损坏链接的网络爬虫。它使用队列对找到的链接进行排队,并使用集合,因此不会重新访问旧链接。它在单线程下工作得很好,但当我尝试线程池时却不行。你能帮我解决这个问题吗?

它打算将新元组 (link,link_parent) 添加到队列中,除非该链接已存在于集合中。它将它解析的所有链接添加到该集合中。

import requests
from lxml import html
from bs4 import BeautifulSoup
import queue
import concurrent.futures
import time

def iter_q(q):
while not q.empty():
yield q.get()

def do_stuff(curr_website_tuple,already_checked,q):
curr_website_father = curr_website_tuple[1]
curr_website = curr_website_tuple[0]
already_checked.add(curr_website)
try:
r = requests.get(curr_website, timeout=10)
ret_status_code = r.status_code
if r.status_code is 200:
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.find_all('a', href=True):
if (link['href'].startswith("http") and
"yahoo." in link['href'] and
".blogs.yahoo." not in link['href'] and
"doubleclick." not in link['href'] and
"adw.yahoo.com" not in link['href'] and
"google." not in link['href'] and
link['href'] not in already_checked):
q.put((link['href'],curr_website))
return curr_website + ' ' + curr_website_father + ' ' + str(r.status_code) + ' ' + "|Number checked:" + str(len(already_checked)) + ' ' + "|Queue size:" + str(q.qsize())
else:
return "Request_Error: " + ',' + curr_website + ',' + curr_website_father + ',' + str(r.status_code) + '\n'
except Exception as e:
return "Error: " + ',' + curr_website + ',' + curr_website_father + ',' + str(e) + '\n'

def with_threads():
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
q = queue.LifoQueue()
already_checked = set()
q.put(("http://www.yahoo.com",''))
q.put(("http://news.yahoo.com",''))
futures_dict = { executor.submit(do_stuff, qe, already_checked, q) : qe for qe in iter_q(q)}
for future in concurrent.futures.as_completed(futures_dict):
print(future.result())


with_threads()

最佳答案

我认为问题可能是您在with executor构造中声明了already_checked。尝试向外部声明,看看效果如何。

关于python - 如何在并发方法中使用集合和队列来验证已经完成的作业,以免再次对它们进行排队?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48118194/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com