gpt4 book ai didi

redirect - Scrapy 处理 302 响应代码

转载 作者:行者123 更新时间:2023-12-04 22:58:12 28 4
gpt4 key购买 nike

我正在使用一个简单的 CrawlSpider实现抓取网站。默认 Scrapy遵循 302 重定向到目标位置,并忽略最初请求的链接。在一个特定的站点上,我遇到了一个 302 重定向到另一个页面的页面。我的目标是记录原始链接(响应 302)和目标位置(在 HTTP 响应 header 中指定)并在 parse_item 中处理它们。 CrawlSpider的方法.请指导我,我怎样才能做到这一点?

我遇到了提到使用 dont_redirect=True 的解决方案或 REDIRECT_ENABLE=False但我实际上并不想忽略重定向,事实上我也想考虑(即不忽略)重定向页面。

例如:我访问 http://www.example.com/page1它发送 302 重定向 HTTP 响应并重定向到 http://www.example.com/page2 .默认情况下,scrapy 忽略 page1 , 关注 page2并对其进行处理。我想同时处理 page1page2parse_item .

编辑
我已经在使用 handle_httpstatus_list = [500, 404]在蜘蛛的类定义中处理 500404 parse_item 中的响应代码,但同样不适用于 302如果我在 handle_httpstatus_list 中指定它.

最佳答案

Scrapy 1.0.5(我写这些行时的最新官方)不使用 handle_httpstatus_list在内置的 RedirectMiddleware 中——见 this issue .
从 Scrapy 1.1.0 ( 1.1.0rc1 is available ), the issue is fixed .

即使你禁用了重定向,你仍然可以在你的回调中模仿它的行为,检查 Location header 并返回 Request到重定向

示例蜘蛛:

$ cat redirecttest.py
import scrapy


class RedirectTest(scrapy.Spider):

name = "redirecttest"
start_urls = [
'http://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip'
]
handle_httpstatus_list = [302]

def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, dont_filter=True, callback=self.parse_page)

def parse_page(self, response):
self.logger.debug("(parse_page) response: status=%d, URL=%s" % (response.status, response.url))
if response.status in (302,) and 'Location' in response.headers:
self.logger.debug("(parse_page) Location header: %r" % response.headers['Location'])
yield scrapy.Request(
response.urljoin(response.headers['Location']),
callback=self.parse_page)

控制台日志:
$ scrapy runspider redirecttest.py -s REDIRECT_ENABLED=0
[scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
[scrapy] INFO: Optional features available: ssl, http11
[scrapy] INFO: Overridden settings: {'REDIRECT_ENABLED': '0'}
[scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
[scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
[scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
[scrapy] INFO: Enabled item pipelines:
[scrapy] INFO: Spider opened
[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/get> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/get
[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=302, URL=https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip
[redirecttest] DEBUG: (parse_page) Location header: 'http://httpbin.org/ip'
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/ip
[scrapy] INFO: Closing spider (finished)

请注意,您需要 http_handlestatus_list里面有 302,否则,你会看到这种日志(来自 HttpErrorMiddleware ):
[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[scrapy] DEBUG: Ignoring response <302 https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip>: HTTP status code is not handled or not allowed

关于redirect - Scrapy 处理 302 响应代码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35330707/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com