gpt4 book ai didi

python - 无法访问位于目标元素之外的某些文本

转载 作者:行者123 更新时间:2023-12-01 07:58:55 26 4
gpt4 key购买 nike

我用 scrapy 编写了一个脚本,用于从网页中获取不同问题的答案。问题是答案超出了我当前目标的元素。我知道如果我用于 BeautifulSoup,我可以使用 .next_sibling 获取它们,但如果是 scrapy,我找不到任何想法。

website link

Html 元素如下:

  <p>
<b>
<span class="blue">
Q:1-The NIST Information Security and Privacy Advisory Board (ISPAB) paper "Perspectives on Cloud Computing and Standards" specifies potential advantages and disdvantages of virtualization. Which of the following disadvantages does it include?
</span>
<br/>
Mark one answer:
</b>
<br/>
<input name="quest1" type="checkbox" value="1"/>
It initiates the risk that malicious software is targeting the VM environment.
<br/>
<input name="quest1" type="checkbox" value="2"/>
It increases overall security risk shared resources.
<br/>
<input name="quest1" type="checkbox" value="3"/>
It creates the possibility that remote attestation may not work.
<br/>
<input name="quest1" type="checkbox" value="4"/>
All of the above
</p>

到目前为止我已经尝试过:

import requests
from scrapy import Selector

url = "https://www.test-questions.com/csslp-exam-questions-01.php"

res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
sel = Selector(res)
for item in sel.css("[name^='quest']::text").getall():
print(item)

上面的脚本执行时不打印任何内容,也不会抛出任何错误。

上面粘贴的 html 元素的预期输出之一是:

It initiates the risk that malicious software is targeting the VM environment.

我只追求任何 css 选择器解决方案。

How can I grab the answers of different question from that site?

最佳答案

以下简单的 css 选择器和 python 列表函数的组合可以解决此任务:

import scrapy
from scrapy.crawler import CrawlerProcess

class QuestionsSpider(scrapy.Spider):
name = "TestSpider"
start_urls = ["https://www.test-questions.com/csslp-exam-questions-01.php"]

def parse(self,response):
#select <p> tag elements with questions/answers
questions_p_tags = [ p for p in response.css("form p")
if '<span class="blue"' in p.extract()]
for p in questions_p_tags:
#select question and answer variants inside every <p> tag
item = dict()
item["question"] = p.css("span.blue::text").extract_first()
#following list comprehension - select all text, filter empty text elements
#and select last 4 text elements as answer variants
item["variants"] = [variant.strip() for variant in p.css("::text").extract() if variant.strip()][-4:]
yield item

if __name__ == "__main__":
c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
c.crawl(QuestionsSpider)
c.start()

关于python - 无法访问位于目标元素之外的某些文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55810706/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com