gpt4 book ai didi

python - Scrapy:检查页面是否包含HTML Form元素

转载 作者:太空宇宙 更新时间:2023-11-03 20:18:05 24 4
gpt4 key购买 nike

我需要一个 scrapy 脚本来探索整个网站并仅保存其中包含 form HTML 标记的页面。

这是我当前的方法,但效果不佳

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
name = 'mps'
allowed_domains = ['some.url.com']
start_urls = ['https://some.url.com/']

rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)

def parse_item(self, response):
hasForm = response.xpath("//form[@id = 'aspnetForm']/form").extract_first(default='not-found')
if hasForm == 'not-found':
pass
else:
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
pass

更新:

我还需要排除具有特定 ID 的 form

最佳答案

示例

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
name = 'mps'
allowed_domains = ['some.url.com']
start_urls = ['https://some.url.com/']

rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)

def parse_item(self, response):
hasForm = response.xpath("//form").extract_first(default='not-found')
if hasForm != 'not-found':
page = response.url.split("/")[-2]
filename = 'test-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)

关于python - Scrapy:检查页面是否包含HTML Form元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58326254/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com