gpt4 book ai didi

python - 在 Python 中使用 Selenium 导航并使用 BeautifulSoup 进行抓取

转载 作者:太空宇宙 更新时间:2023-11-03 13:26:16 24 4
gpt4 key购买 nike

好的,这就是我要实现的目标:

  1. 调用带有动态过滤搜索结果列表的 URL
  2. 点击第一个搜索结果(5/页)
  3. 抓取标题、段落和图像并将它们作为 json 对象存储在单独的文件中,例如

    {
    "Title": "单个条目的标题元素",
    "Content": "DOM order od the individual entry 中的段落和图像"

  4. 导航回搜索结果概览页面并重复步骤 2 - 3

  5. 在抓取 5/5 结果后转到下一页(单击分页链接)
  6. 重复步骤 2 - 5,直到没有条目剩余

再次形象化意图: enter image description here

我目前拥有的是:

#import libraries
from selenium import webdriver
from bs4 import BeautfifulSoup

#URL
url = "https://URL.com"

#Create a browser session
driver = webdriver.Chrome("PATH TO chromedriver.exe")
driver.implicitly_wait(30)
driver.get(url)

#click consent btn on destination URL ( overlays rest of the content )
python_consentButton = driver.find_element_by_id('acceptAllCookies')
python_consentButton.click() #click cookie consent btn

#Seleium hands the page source to Beautiful Soup
soup_results_overview = BeautifulSoup(driver.page_source, 'lxml')


for link in soup_results_overview.findAll("a", class_="searchResults__detail"):

#Selenium visits each Search Result Page
searchResult = driver.find_element_by_class_name('searchResults__detail')
searchResult.click() #click Search Result

#Ask Selenium to go back to the search results overview page
driver.back()

#Tell Selenium to click paginate "next" link
#probably needs to be in a sourounding for loop?
paginate = driver.find_element_by_class_name('pagination-link-next')
paginate.click() #click paginate next

driver.quit()

问题
每次 Selenium 导航回搜索结果概览页面时,列表计数都会重置所以它点击第一个条目 5 次,导航到接下来的 5 个项目并停止

不确定这可能是递归方法的预定情况。

如有任何关于如何解决此问题的建议,我们将不胜感激。

最佳答案

您只能使用 requestsBeautifulSoup 进行抓取,而无需使用 Selenium。它会更快并且消耗更少的资源:

import json
import requests
from bs4 import BeautifulSoup

# Get 1000 results
params = {"$filter": "TemplateName eq 'Application Article'", "$orderby": "ArticleDate desc", "$top": "1000",
"$inlinecount": "allpages", }
response = requests.get("https://www.cst.com/odata/Articles", params=params).json()

# iterate 1000 results
articles = response["value"]
for article in articles:
article_json = {}
article_content = []

# title of article
article_title = article["Title"]
# article url
article_url = str(article["Url"]).split("|")[1]
print(article_title)

# request article page and parse it
article_page = requests.get(article_url).text
page = BeautifulSoup(article_page, "html.parser")

# get header
header = page.select_one("h1.head--bordered").text
article_json["Title"] = str(header).strip()
# get body content with images links and descriptions
content = page.select("section.content p, section.content img, section.content span.imageDescription, "
"section.content em")
# collect content to json format
for x in content:
if x.name == "img":
article_content.append("https://cst.com/solutions/article/" + x.attrs["src"])
else:
article_content.append(x.text)

article_json["Content"] = article_content

# write to json file
with open(f"{article_json['Title']}.json", 'w') as to_json_file:
to_json_file.write(json.dumps(article_json))

print("the end")

关于python - 在 Python 中使用 Selenium 导航并使用 BeautifulSoup 进行抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55197425/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com