gpt4 book ai didi

python - 如何通过 TOR 上的 Polipo 使用 Scrapy 连接到 https 站点?

转载 作者:太空狗 更新时间:2023-10-29 21:11:14 24 4
gpt4 key购买 nike

不完全确定这里的问题是什么。

运行 Python 2.7.3 和 Scrapy 0.16.5

我创建了一个非常简单的 Scrapy 蜘蛛来测试连接到我的本地 Polipo 代理,这样我就可以通过 TOR 发送请求。我的爬虫基本代码如下:

from scrapy.spider import BaseSpider

class TorSpider(BaseSpider):
name = "tor"
allowed_domains = ["check.torproject.org"]
start_urls = [
"https://check.torproject.org"
]

def parse(self, response):
print response.body

对于我的代理中间件,我定义了:

class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = settings.get('HTTP_PROXY')

我的设置文件中的 HTTP_PROXY 定义为 HTTP_PROXY = 'http://localhost:8123'

现在,如果我将起始 URL 更改为 http://check.torproject.org ,一切正常,没有问题。

如果我尝试与 https://check.torproject.org 竞争,我每次都收到 400 Bad Request 错误(我也尝试过不同的 https://站点,它们都有同样的问题):

2013-07-23 21:36:18+0100 [scrapy] INFO: Scrapy 0.16.5 started (bot: arachnid)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, ProxyMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled item pipelines:
2013-07-23 21:36:18+0100 [tor] INFO: Spider opened
2013-07-23 21:36:18+0100 [tor] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 1 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 2 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Gave up retrying <GET https://check.torproject.org> (failed 3 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Crawled (400) <GET https://check.torproject.org> (referer: None)
2013-07-23 21:36:18+0100 [tor] INFO: Closing spider (finished)

为了仔细检查我的 TOR/Polipo 设置没有问题,我能够在终端中运行以下 curl 命令,并正常连接:curl --proxy localhost: 8123 https://check.torproject.org/

关于这里有什么问题有什么建议吗?

最佳答案

尝试

rq.meta['proxy'] = 'http://127.0.0.1:8123'

在我的例子中它是可行的

关于python - 如何通过 TOR 上的 Polipo 使用 Scrapy 连接到 https 站点?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17820824/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com