gpt4 book ai didi

python - 某些网站上的 Scrapy 超时

转载 作者:太空宇宙 更新时间:2023-11-03 17:13:47 28 4
gpt4 key购买 nike

在我自己的机器上尝试过

> scrapy fetch http://google.com/ 

> scrapy fetch http://stackoverflow.com/ 

工作完美,但不知何故,www.flyertalk.com 与 scrapy 配合得不好。我不断收到超时错误(180 秒):

> scrapy fetch http://www.flyertalk.com/ 

然而,curl 工作正常,没有任何问题

> curl -s http://www.flyertalk.com/ 

很奇怪。这是完整的转储:

2015-11-20 17:35:07 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-20 17:35:07 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-20 17:35:07 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-20 17:35:07 [scrapy] INFO: Enabled item pipelines:
2015-11-20 17:35:07 [scrapy] INFO: Spider opened
2015-11-20 17:35:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:35:07 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6037
2015-11-20 17:36:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:37:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:38:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:38:07 [scrapy] DEBUG: Retrying <GET http://www.flyertalk.com> (failed 1 times): User timeout caused connection failure: Getting http://www.flyertalk.com took longer than 180.0 seconds..
2015-11-20 17:39:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:40:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:41:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:41:07 [scrapy] DEBUG: Retrying <GET http://www.flyertalk.com> (failed 2 times): User timeout caused connection failure: Getting http://www.flyertalk.com took longer than 180.0 seconds..

最佳答案

我做了一些尝试。 USER-AGENT header 使一切变得不同:

$ scrapy shell http://www.flyertalk.com/ -s USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'
In [1]: response.xpath("//title/text()").extract_first().strip()
Out[1]: u"FlyerTalk - The world's most popular frequent flyer community - FlyerTalk is a living, growing community where frequent travelers around the world come to exchange knowledge and experiences about everything miles and points related."

如果不指定 header ,我会看到它永远挂起。

关于python - 某些网站上的 Scrapy 超时,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33838795/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com