gpt4 book ai didi

python - 无法在 Scrapy 中获取所有 http 请求

转载 作者:太空宇宙 更新时间:2023-11-04 09:52:25 27 4
gpt4 key购买 nike

我正在尝试访问一个站点并检查是否没有重定向到该站点内某个页面的链接已关闭。由于没有可用的站点地图,我正在使用 Scrapy抓取网站并获取每个页面上的所有链接,但我无法让它输出包含找到的所有链接及其状态代码的文件。我用来测试代码的网站是 quotes.toscrape.com,我的代码是:

from scrapy.spiders import Spider
from mytest.items import MytestItem
from scrapy.http
import Request
import re
class MySpider(Spider):
name = "sample"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com"]
def parse(self, response):
links = response.xpath('//a/@href').extract()
\# We stored already crawled links in this list
crawledLinks = []
for link in links:
\# If it is a proper link and is not checked yet, yield it to the Spider
if link not in crawledLinks:
link = "http://quotes.toscrape.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)

我尝试在 yield 之后添加以下行:

item = MytestItem()
item['url'] = link
item['status'] = response.status
yield item

但它给我带来了一堆重复项,并且没有状态为 404 或 301 的 url。有谁知道我如何获得所有具有状态的 url?

最佳答案

默认情况下,Scrapy 不会返回任何不成功的请求,但如果您设置 errback on the request,您可以获取它们并在您的函数之一中处理它们.

def parse(self, response):
# some code
yield Request(link, self.parse, errback=self.parse_error)

def parse_error(self, failure):
# log the response as an error

参数failure 将包含more information on the exact reason对于失败,因为它可能是 HTTP 错误(您可以在其中获取响应),但也可能是 DNS 查找错误等(没有响应)。

该文档包含一个示例,说明如何使用失败来确定错误原因并访问 Response(如果可用):

def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))

# in case you want to do something special for some errors,
# you may need the failure's type:

if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)

elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)

elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)

关于python - 无法在 Scrapy 中获取所有 http 请求,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47251319/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com