gpt4 book ai didi

python - 如何使用 BeautifulSoup 查找所有下一个链接

转载 作者:太空宇宙 更新时间:2023-11-04 00:38:00 25 4
gpt4 key购买 nike

我目前正在通过预设一个名为 number_of_pages 的变量来抓取特定网站的所有页面。预设此变量一直有效,直到添加了一个我不知道的新页面。例如下面的代码是 3 页,但网站现在有 4 页。

base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
number_of_pages = 3
for i in range(1, number_of_pages, 1):
url_to_scrape = (base_url + str(i))

我想使用 BeautifulSoup 来查找网站上的所有下一个链接以进行抓取。下面的代码找到了第二个 URL,但没有找到第三个或第四个。我如何在抓取所有页面之前建立一个列表?

base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
CrawlRequest = requests.get(base_url)
raw_html = CrawlRequest.text
linkSoupParser = BeautifulSoup(raw_html, 'html.parser')
page = linkSoupParser.find('div', {'class': 'pagination'})
for list_of_links in page.find('a', href=True, text='next'):
nextURL = 'https://securityadvisories.paloaltonetworks.com' + list_of_links.parent['href']
print (nextURL)

最佳答案

有几种不同的方法来处理分页。这是其中之一。

这个想法是初始化一个无限循环并在没有“下一个”链接时中断它:

from urllib.parse import urljoin

from bs4 import BeautifulSoup
import requests


with requests.Session() as session:
page_number = 1
url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
while True:
print("Processing page: #{page_number}; url: {url}".format(page_number=page_number, url=url))
response = session.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# check if there is next page, break if not
next_link = soup.find("a", text="next")
if next_link is None:
break

url = urljoin(url, next_link["href"])
page_number += 1

print("Done.")

如果你执行它,你会看到打印出以下消息:

Processing page: #1; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=
Processing page: #2; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=2
Processing page: #3; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=3
Processing page: #4; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=4
Done.

请注意,为了提高性能并在请求中保留 cookie,我们正在维护一个网络抓取 session requests.Session .

关于python - 如何使用 BeautifulSoup 查找所有下一个链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43075872/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com