gpt4 book ai didi

python - 在没有有效响应之前无法一一使用代理

转载 作者:太空狗 更新时间:2023-10-30 00:00:55 24 4
gpt4 key购买 nike

我在 python 的 scrapy 中编写了一个脚本,通过 get_proxies() 方法使用任一新生成的代理发出代理请求。我使用 requests 模块来获取代理,以便在脚本中重用它们。但是,问题是我的脚本选择使用的代理可能并不总是好的,所以有时它无法获取有效响应。

How can I let my script keep trying with different proxies until there is a valid response?

到目前为止我的脚本:

import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess

class ProxySpider(scrapy.Spider):
name = "sslproxies"
check_url = "https://stackoverflow.com/questions/tagged/web-scraping"
proxy_link = "https://www.sslproxies.org/"

def start_requests(self):
proxylist = self.get_proxies()
random.shuffle(proxylist)
proxy_ip_port = next(cycle(proxylist))
print(proxy_ip_port) #Checking out the proxy address
request = scrapy.Request(self.check_url, callback=self.parse,errback=self.errback_httpbin,dont_filter=True)
request.meta['proxy'] = "http://{}".format(proxy_ip_port)
yield request

def get_proxies(self):
response = requests.get(self.proxy_link)
soup = BeautifulSoup(response.text,"lxml")
proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
return proxy

def parse(self, response):
print(response.meta.get("proxy")) #Compare this to the earlier one whether they both are the same

def errback_httpbin(self, failure):
print("Failure: "+str(failure))

if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'DOWNLOAD_TIMEOUT' : 5,
})
c.crawl(ProxySpider)
c.start()

PS 我的意图是按照我在这里开始的方式寻求任何解决方案。

最佳答案

众所周知,http 响应需要通过所有中间件才能到达蜘蛛方法。

这意味着只有具有有效代理的请求才能继续执行蜘蛛回调函数。

为了使用有效代理,我们需要首先检查所有代理,然后仅从有效代理中进行选择。

当我们之前选择的代理不再工作时 - 我们将此代理标记为无效并从蜘蛛 errback 中剩余的有效代理中选择一个新代理。

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http.request import Request

class ProxySpider(scrapy.Spider):
name = "sslproxies"
check_url = "https://stackoverflow.com/questions/tagged/web-scraping"
proxy_link = "https://www.sslproxies.org/"
current_proxy = ""
proxies = {}

def start_requests(self):
yield Request(self.proxy_link,callback=self.parse_proxies)

def parse_proxies(self,response):

for row in response.css("table#proxylisttable tbody tr"):
if "yes" in row.extract():
td = row.css("td::text").extract()
self.proxies["http://{}".format(td[0]+":"+td[1])]={"valid":False}

for proxy in self.proxies.keys():
yield Request(self.check_url,callback=self.parse,errback=self.errback_httpbin,
meta={"proxy":proxy,
"download_slot":proxy},
dont_filter=True)

def parse(self, response):
if "proxy" in response.request.meta.keys():
#As script reaches this parse method we can mark current proxy as valid
self.proxies[response.request.meta["proxy"]]["valid"] = True
print(response.meta.get("proxy"))
if not self.current_proxy:
#Scraper reaches this code line on first valid response
self.current_proxy = response.request.meta["proxy"]
#yield Request(next_url, callback=self.parse_next,
# meta={"proxy":self.current_proxy,
# "download_slot":self.current_proxy})

def errback_httpbin(self, failure):
if "proxy" in failure.request.meta.keys():
proxy = failure.request.meta["proxy"]
if proxy == self.current_proxy:
#If current proxy after our usage becomes not valid
#Mark it as not valid
self.proxies[proxy]["valid"] = False
for ip_port in self.proxies.keys():
#And choose valid proxy from self.proxies
if self.proxies[ip_port]["valid"]:
failure.request.meta["proxy"] = ip_port
failure.request.meta["download_slot"] = ip_port
self.current_proxy = ip_port
return failure.request
print("Failure: "+str(failure))

if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'COOKIES_ENABLED': False,
'DOWNLOAD_TIMEOUT' : 10,
'DOWNLOAD_DELAY' : 3,
})
c.crawl(ProxySpider)
c.start()

关于python - 在没有有效响应之前无法一一使用代理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54801031/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com