gpt4 book ai didi

python - Scrapy 错误 - HTTP 状态代码未处理或不允许

转载 作者:太空宇宙 更新时间:2023-11-03 13:41:32 28 4
gpt4 key购买 nike

我正在尝试运行蜘蛛但有这个日志:

2015-05-15 12:44:43+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: reviews)
2015-05-15 12:44:43+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-15 12:44:43+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'reviews.spiders', 'SPIDER_MODULES': ['reviews.spiders'], 'DOWNLOAD_DELAY': 2, 'BOT_NAME': 'reviews'}
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-15 12:44:43+0100 [scrapy] INFO: Enabled item pipelines:
2015-05-15 12:44:43+0100 [theverge] INFO: Spider opened
2015-05-15 12:44:43+0100 [theverge] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-15 12:44:43+0100 [scrapy] ERROR: Error caught on signal handler: <bound method ?.start_listening of <scrapy.telnet.TelnetConsole instance at 0x105127b48>>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 1107, in _inlineCallbacks
result = g.send(result)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/core/engine.py", line 77, in start
yield self.signals.send_catch_log_deferred(signal=signals.engine_started)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(*a, **kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred
*arguments, **named)
--- <exception caught here> ---
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 140, in maybeDeferred
result = f(*args, **kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 54, in robustApply
return receiver(*arguments, **named)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/telnet.py", line 47, in start_listening
self.port = listen_tcp(self.portrange, self.host, self)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/reactor.py", line 14, in listen_tcp
return reactor.listenTCP(x, factory, interface=host)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 495, in listenTCP
p.startListening()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/tcp.py", line 984, in startListening
raise CannotListenError(self.interface, self.port, le)
twisted.internet.error.CannotListenError: Couldn't listen on 127.0.0.1:6073: [Errno 48] Address already in use.

第一个错误开始出现在所有蜘蛛中,但其他蜘蛛仍然工作。 The:“[Errno 48] 地址已在使用中。”然后是:

2015-05-15 12:44:43+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6198
2015-05-15 12:44:44+0100 [theverge] DEBUG: Crawled (403) <GET http://www.theverge.com/reviews> (referer: None)
2015-05-15 12:44:44+0100 [theverge] DEBUG: Ignoring response <403 http://www.theverge.com/reviews>: HTTP status code is not handled or not allowed
2015-05-15 12:44:44+0100 [theverge] INFO: Closing spider (finished)
2015-05-15 12:44:44+0100 [theverge] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 191,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 265,
'downloader/response_count': 1,
'downloader/response_status_count/403': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 15, 11, 44, 44, 136026),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 5, 15, 11, 44, 43, 829689)}
2015-05-15 12:44:44+0100 [theverge] INFO: Spider closed (finished)
2015-05-15 12:44:44+0100 [scrapy] ERROR: Error caught on signal handler: <bound method ?.stop_listening of <scrapy.telnet.TelnetConsole instance at 0x105127b48>>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 1107, in _inlineCallbacks
result = g.send(result)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/core/engine.py", line 300, in _finish_stopping_engine
yield self.signals.send_catch_log_deferred(signal=signals.engine_stopped)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(*a, **kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred
*arguments, **named)
--- <exception caught here> ---
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 140, in maybeDeferred
result = f(*args, **kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 54, in robustApply
return receiver(*arguments, **named)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/telnet.py", line 53, in stop_listening
self.port.stopListening()
exceptions.AttributeError: TelnetConsole instance has no attribute 'port'

错误“exceptions.AttributeError: TelnetConsole instance has no attribute 'port'”对我来说是新的...不知道发生了什么,因为我所有其他网站的蜘蛛都运行良好。

谁能告诉我如何解决?

编辑:

重新启动后,此错误消失了。但是仍然不能用这个蜘蛛爬行......现在这里是日志:

2015-05-15 15:46:55+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: reviews)piders_toshub/reviews (spiderDev) $ scrapy crawl theverge 
2015-05-15 15:46:55+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-15 15:46:55+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'reviews.spiders', 'SPIDER_MODULES': ['reviews.spiders'], 'DOWNLOAD_DELAY': 2, 'BOT_NAME': 'reviews'}
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-15 15:46:55+0100 [scrapy] INFO: Enabled item pipelines:
2015-05-15 15:46:55+0100 [theverge] INFO: Spider opened
2015-05-15 15:46:55+0100 [theverge] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-15 15:46:55+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-15 15:46:55+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-15 15:46:56+0100 [theverge] DEBUG: Crawled (403) <GET http://www.theverge.com/reviews> (referer: None)
2015-05-15 15:46:56+0100 [theverge] DEBUG: Ignoring response <403 http://www.theverge.com/reviews>: HTTP status code is not handled or not allowed
2015-05-15 15:46:56+0100 [theverge] INFO: Closing spider (finished)
2015-05-15 15:46:56+0100 [theverge] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 191,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 265,
'downloader/response_count': 1,
'downloader/response_status_count/403': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 15, 14, 46, 56, 8769),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 5, 15, 14, 46, 55, 673723)}
2015-05-15 15:46:56+0100 [theverge] INFO: Spider closed (finished)

这个“2015-05-15 15:46:56+0100 [theverge] DEBUG: Ignoring response <403 http://www.theverge.com/reviews >: HTTP status code is not handled or not allowed”很奇怪,因为我使用的是 download_delay = 2 和上周我可以毫无问题地抓取该网站...会发生什么?

最佳答案

Address already in use 会暗示其他东西正在监听那个端口,很可能你正在并行运行另一个蜘蛛?第二个错误只是第一个错误的结果,因为它没有正确实例化端口,现在找不到它来关闭它。

我会建议重新启动以确保没有端口仍在使用,并且只运行一个 spider 以查看它是否正常工作。如果再次发生这种情况,您可以使用 netstat 或类似工具调查哪个应用程序正在使用该端口。

更新:HTTP 错误 403 Forbidden 很可能意味着您因发出过多请求而被网站禁止。要解决这个问题,请使用代理服务器。结帐Scrapy HttpProxyMiddleware .

关于python - Scrapy 错误 - HTTP 状态代码未处理或不允许,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30258804/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com