gpt4 book ai didi

python - 抓取包含::之前的网页

转载 作者:行者123 更新时间:2023-11-28 02:33:37 25 4
gpt4 key购买 nike

我的问题是,当使用 bs4 抓取 HTML 时,无法抓取 ::before 之类的内容。

我想知道该公司为页面中的哪些可持续发展目标做出了贡献。 https://www.unglobalcompact.org/participation/report/cop/create-and-submit/active/395091但是复选标记在源代码中是不可见的。

我应该怎么做或者我可以用什么从网站上抓取它?

最佳答案

这里根本不需要 ::before:: 部分。选中和未选中的元素具有不同的类 - 选中的元素具有 selected_question,未选中的元素具有 advanced_question

您可以使用类似的方式解析它:

from bs4 import BeautifulSoup
import requests


url = "https://www.unglobalcompact.org/participation/report/cop/create-and-submit/active/395091"
response = requests.get(url)

soup = BeautifulSoup(response.content, "lxml")

questions = soup.select("ul.questionnaire > li.question_group")
for question in questions:
question_text = question.get_text(strip=True)
print(question_text)

answers = question.find_next_siblings("li")
for answer in answers:
answer_text = answer.get_text(strip=True)
is_selected = "selected_question" in answer.get("class", [])

print(answer_text, is_selected)
print("-----")

将打印:

Which of the following Sustainable Development Goals (SDGs) do the activities described in your COP address? [Select all that apply]
SDG 1: End poverty in all its forms everywhere False
SDG 2: End hunger, achieve food security and improved nutrition and promote sustainable agriculture False
SDG 3: Ensure healthy lives and promote well-being for all at all ages True
SDG 4: Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all False
...

注意为所选答案打印的 True

我还注意到,如果选择 html.parser 作为解析器,此代码将无法正常工作。

关于python - 抓取包含::之前的网页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47561116/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com