gpt4 book ai didi

python - 如何为通过 socksipy 发出请求的 scrapy 编写 DownloadHandler?

转载 作者:太空狗 更新时间:2023-10-29 22:19:16 71 4
gpt4 key购买 nike

我正在尝试在 Tor 上使用 scrapy。我一直在努力思考如何为使用 socksipy 连接的 scrapy 编写 DownloadHandler。

Scrapy 的 HTTP11DownloadHandler 在这里:https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py

以下是创建自定义下载处理程序的示例: https://github.com/scrapinghub/scrapyjs/blob/master/scrapyjs/dhandler.py

这是创建 SocksiPyConnection 类的代码:http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/

class SocksiPyConnection(httplib.HTTPConnection):
def __init__(self, proxytype, proxyaddr, proxyport = None, rdns = True, username = None, password = None, *args, **kwargs):
self.proxyargs = (proxytype, proxyaddr, proxyport, rdns, username, password)
httplib.HTTPConnection.__init__(self, *args, **kwargs)

def connect(self):
self.sock = socks.socksocket()
self.sock.setproxy(*self.proxyargs)
if isinstance(self.timeout, float):
self.sock.settimeout(self.timeout)
self.sock.connect((self.host, self.port))

由于 scrapy 代码中扭曲 react 器的复杂性,我不知道如何将 socksipy 插入其中。有什么想法吗?

请不要用类似 privoxy 的替代方案来回答或发布回答说“scrapy 不适用于 socks 代理”——我知道,这就是为什么我试图编写一个使用 socksipy 发出请求的自定义下载器。

最佳答案

我能够使用 https://github.com/habnabit/txsocksx 完成这项工作.

完成 pip install txsocksx 后,我需要用 txsocksx.http.SOCKS5Agent 替换 scrapy 的 ScrapyAgent

我只是从 scrapy/core/downloader/handlers/http.py 复制了 HTTP11DownloadHandlerScrapyAgent 的代码,将它们子类化并编写这段代码:

class TorProxyDownloadHandler(HTTP11DownloadHandler):

def download_request(self, request, spider):
"""Return a deferred for the HTTP download"""
agent = ScrapyTorAgent(contextFactory=self._contextFactory, pool=self._pool)
return agent.download_request(request)


class ScrapyTorAgent(ScrapyAgent):
def _get_agent(self, request, timeout):
bindaddress = request.meta.get('bindaddress') or self._bindAddress
proxy = request.meta.get('proxy')
if proxy:
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
scheme = _parse(request.url)[0]
omitConnectTunnel = proxyParams.find('noconnect') >= 0
if scheme == 'https' and not omitConnectTunnel:
proxyConf = (proxyHost, proxyPort,
request.headers.get('Proxy-Authorization', None))
return self._TunnelingAgent(reactor, proxyConf,
contextFactory=self._contextFactory, connectTimeout=timeout,
bindAddress=bindaddress, pool=self._pool)
else:
_, _, host, port, proxyParams = _parse(request.url)
proxyEndpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
timeout=timeout, bindAddress=bindaddress)
agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint)
return agent

return self._Agent(reactor, contextFactory=self._contextFactory,
connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

在 settings.py 中,需要这样的东西:

DOWNLOAD_HANDLERS = {
'http': 'crawler.http.TorProxyDownloadHandler'
}

现在通过像 Tor 这样的 socks 代理使用 Scrapy 进行代理。

关于python - 如何为通过 socksipy 发出请求的 scrapy 编写 DownloadHandler?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21839676/

71 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com