gpt4 book ai didi

python - Scrapy Max重定向问题

转载 作者:行者123 更新时间:2023-12-01 06:01:58 25 4
gpt4 key购买 nike

我只是想抓取一个页面

start_urls = ['https://www.mileageplusshopping.com/shopping/b____alpha.htm']

但是一次又一次的重定向,最后scrapy丢弃

尽管我尝试过

REDIRECT_MAX_TIMES=100

这个设置也是如此,但仍然重定向100次并且scrapy丢弃

如有任何帮助,我们将不胜感激

这是日志。

2012-04-02 20:10:53+0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Discarding <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>: max redirections reached
2012-04-02 20:10:54+0500 [mileageplusshopping] ERROR: Error downloading <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>:
2012-04-02 20:10:54+0500 [mileageplusshopping] INFO: Closing spider (finished)

我使用的是 scrapy 0.14

这是我的设置类

BOT_NAME = 'mall_crawler'
BOT_VERSION = '1.0'

SPIDER_MODULES = ['mall_crawler.spiders']
NEWSPIDER_MODULE = 'mall_crawler.spiders'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8'

RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOAD_DELAY = 1

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0

DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': None,
}

SCHEDULER_ORDER = 'BFO'

最佳答案

我找到了解决方案,所以我想与大家分享

只是因为

HTTPCACHE_ENABLED = True  

实际上start_url是https://www.mileageplusshopping.com/shopping/b____alpha.htm

重定向到https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=错误

重定向到https://x.www.mileageplusshopping.com/shopping/b____alpha.htm

最后重定向到https://www.mileageplusshopping.com/shopping/b____alpha.htm

如果您查看第一个请求和最后一个请求,两者都是相同

这就是为什么最后一个请求它在缓存中发现了这个请求并开始循环,所以如果我们不缓存页面一切都很好。

或者如果我们想缓存页面,我们需要手动处理所有这些。

关于python - Scrapy Max重定向问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9978917/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com