gpt4 book ai didi

python - 使用旋转代理运行 scrapy splash

转载 作者:行者123 更新时间:2023-12-05 07:37:43 25 4
gpt4 key购买 nike

我正在尝试将 scrapy 与启动和旋转代理一起使用。这是我的 settings.py:

ROBOTSTXT_OBEY = False
BOT_NAME = 'mybot'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
LOG_LEVEL = 'INFO'
USER_AGENT = 'Mozilla/5.0'

# JSON file pretty formatting
FEED_EXPORT_INDENT = 4

# Suppress dataloss warning messages of scrapy downloader
DOWNLOAD_FAIL_ON_DATALOSS = False
DOWNLOAD_DELAY = 1.25

# Enable or disable spider middlewares
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Enable or disable downloader middlewares
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

# Splash settings
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
SPLASH_URL = 'http://localhost:8050'

我正在我的蜘蛛中设置 ROTATING_PROXY_LIST:

proxy_list = re.findall(r'(\d*\.\d*\.\d*\.\d*\:\d*)\b',
requests.get("https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list.txt").text)
custom_settings = {'ROTATING_PROXY_LIST': proxy_list}

我开始飞溅docker run -p 8050:8050 scrapinghub/splash。以下是启动请求的方式:

def start_requests(self):
urls = [ 'http://example-com/page_1.html', 'http://example-com/page_1.html']
for url in urls:
yield SplashRequest(url,
self.parse_url,
headers={'User-Agent': self.user_agent },
args = {'render_all': 1, 'wait': 0.5}
)

但是,在运行爬虫时,我没有看到任何请求通过 Splash。我该如何解决这个问题?

谢谢津

最佳答案

我认为我们不能在 splash 中使用 scrapy-rotating-proxies,如果你想在 splash 中使用代理试试这个:

yield SplashRequest(
'https://ipv4.icanhazip.com/',
self.parse_response,
endpoint='execute',
args={
'lua_source': self.lua_script,
'http_method': 'POST',
'timeout': 60,
'proxy': 'http://use:pass@Ip:Port'
},
errback=self.errback_httpbin)

如果你想对带有 Splash 请求的 Scrapy 请求使用 scrapy-rotating-proxies,请添加另一个中间件以排除来自 Splash 的请求。

设置.py:

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':
810,
'scrapping_tool.middlewares.ProxiesMiddleware': 400,
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

和代理中间件:

class ProxiesMiddleware(object):
def __init__(self, settings):
pass

@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)

def process_request(self, request, spider):
if (isinstance(request,
scrapy.http.request.form.FormRequest) == False):
request.meta['proxy'] = None

关于python - 使用旋转代理运行 scrapy splash,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48378106/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com