gpt4 book ai didi

python - 使用 scrapy spider 抓取 http 状态码

转载 作者:太空狗 更新时间:2023-10-29 21:06:05 26 4
gpt4 key购买 nike

我是 scrapy 的新手。我正在编写一个蜘蛛,旨在检查服务器状态代码的一长串 URL,并在适当的情况下检查它们被重定向到的 URL。重要的是,如果存在重定向链,我需要知道每次跳转时的状态码和 url。我正在使用 response.meta['redirect_urls'] 来捕获 url,但我不确定如何捕获状态代码 - 似乎没有响应元键。

我意识到我可能需要编写一些自定义中间件来公开这些值,但不太清楚如何记录每一跳的状态代码,也不清楚如何从蜘蛛访问这些值。我看过但找不到任何人这样做的例子。如果有人能指出我正确的方向,将不胜感激。

例如,

    items = []
item = RedirectItem()
item['url'] = response.url
item['redirected_urls'] = response.meta['redirect_urls']
item['status_codes'] = #????
items.append(item)

编辑 - 基于 warawauk 的反馈和 IRC channel (freenode #scrappy) 的一些真正主动的帮助,我已经设法做到了这一点。我认为它有点老套,所以欢迎提出任何改进意见:

(1) 在设置中禁用默认的中间件,并添加你自己的:

DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
'myproject.middlewares.CustomRedirectMiddleware': 100,
}

(2) 在您的 middlewares.py 中创建您的 CustomRedirectMiddleware。它继承自主要的重定向中间件类并捕获重定向:

class CustomRedirectMiddleware(RedirectMiddleware):
"""Handle redirection of requests based on response status and meta-refresh html tag"""

def process_response(self, request, response, spider):
#Get the redirect status codes
request.meta.setdefault('redirect_status', []).append(response.status)
if 'dont_redirect' in request.meta:
return response
if request.method.upper() == 'HEAD':
if response.status in [301, 302, 303, 307] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = request.replace(url=redirected_url)

return self._redirect(redirected, request, spider, response.status)
else:
return response

if response.status in [302, 303] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = self._redirect_request_using_get(request, redirected_url)
return self._redirect(redirected, request, spider, response.status)

if response.status in [301, 307] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = request.replace(url=redirected_url)
return self._redirect(redirected, request, spider, response.status)

if isinstance(response, HtmlResponse):
interval, url = get_meta_refresh(response)
if url and interval < self.max_metarefresh_delay:
redirected = self._redirect_request_using_get(request, url)
return self._redirect(redirected, request, spider, 'meta refresh')


return response

(3) 您现在可以访问蜘蛛中的重定向列表

request.meta['redirect_status']

最佳答案

关于python - 使用 scrapy spider 抓取 http 状态码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10982417/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com