gpt4 book ai didi

python - 解析登陆页面的链接时无法提高性能

转载 作者:行者123 更新时间:2023-12-02 16:16:38 25 4
gpt4 key购买 nike

我正在尝试使用 concurrent.futures 在以下脚本中实现多处理。问题是即使当我使用 concurrent.futures 时,性能仍然是一样的。它似乎对执行过程没有任何影响,这意味着它无法提高性能。

我知道如果我创建另一个函数并将从 get_titles() 填充的链接传递给该函数以便从它们的内页中抓取标题,我可以制作这个 concurrent.futures 工作。但是,我希望使用我在下面创建的函数从登录页面获取标题。

我使用迭代方法而不是递归只是因为如果我选择后者,当调用超过 1000 次时函数将抛出递归错误。

这是我目前尝试的方式(我在脚本中使用的站点链接是一个占位符):

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://stackoverflow.com'
link = 'https://stackoverflow.com/questions/tagged/web-scraping'

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
while True:
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
for item in soup.select(".summary > h3"):
post_title = item.select_one("a.question-hyperlink").get("href")
print(urljoin(base,post_title))

next_page = soup.select_one(".pager > a[rel='next']")

if not next_page: return
link = urljoin(base,next_page.get("href"))

if __name__ == '__main__':
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(get_titles,url): url for url in [link]}
futures.as_completed(future_to_url)

问题:

How can I improve the performance while parsing links from landing pages?

编辑:我知道我可以按照下面的路线实现相同的目标,但这不是我最初的尝试

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://stackoverflow.com'
links = ['https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'.format(i) for i in range(1,5)]

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
for item in soup.select(".summary > h3"):
post_title = item.select_one("a.question-hyperlink").get("href")
print(urljoin(base,post_title))

if __name__ == '__main__':
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(get_titles,url): url for url in links}
futures.as_completed(future_to_url)

最佳答案

既然您的抓取工具正在使用线程,为什么不“产生”更多的工作程序来处理来自着陆页的后续 URL?

例如:

import concurrent.futures as futures
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

base = "https://stackoverflow.com"
links = [
f"{base}/questions/tagged/web-scraping?tab=newest&page={i}&pagesize=30"
for i in range(1, 5)
]

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36",
}


def threader(function, target, workers=5):
with futures.ThreadPoolExecutor(max_workers=workers) as executor:
jobs = {executor.submit(function, item): item for item in target}
futures.as_completed(jobs)


def make_soup(page_url: str) -> BeautifulSoup:
return BeautifulSoup(requests.get(page_url).text, "html.parser")


def process_page(page: str):
s = make_soup(page).find("div", class_="grid--cell ws-nowrap mb8")
views = s.getText() if s is not None else "Missing data"
print(f"{page}\n{' '.join(views.split())}")


def make_pages(soup_of_pages: BeautifulSoup) -> list:
return [
urljoin(base, item.select_one("a.question-hyperlink").get("href"))
for item in soup_of_pages.select(".summary > h3")
]


def crawler(link):
while True:
soup = make_soup(link)
threader(process_page, make_pages(soup), workers=10)
next_page = soup.select_one(".pager > a[rel='next']")
if not next_page:
return
link = urljoin(base, next_page.get("href"))


if __name__ == '__main__':
threader(crawler, links)

示例运行输出:

https://stackoverflow.com/questions/66463025/exporting-several-scraped-tables-into-a-single-csv-file
Viewed 19 times
https://stackoverflow.com/questions/66464511/can-you-find-the-parent-of-the-soup-in-beautifulsoup
Viewed 32 times
https://stackoverflow.com/questions/66464583/r-subscript-out-of-bounds-for-reading-an-html-link
Viewed 22 times

and more ...

理由:

从本质上讲,您在最初的方法中所做的是催生工作人员从搜索页面中获取问题 URL。您不处理以下 URL。

我的建议是派生额外的工作人员来处理爬行工作人员收集的内容。

在你的问题中你提到:

I wish to get the titles from landing pages

这就是您的初始方法的调整版本试图通过利用 threader() 来完成的。函数,它基本上是 ThreadPool() 的包装器.

关于python - 解析登陆页面的链接时无法提高性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66488593/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com