gpt4 book ai didi

python - Scrapy 的代理池系统暂时停止使用慢速/超时代理

转载 作者:太空狗 更新时间:2023-10-29 17:51:18 24 4
gpt4 key购买 nike

我一直在四处寻找,试图为 Scrapy 找到一个像样的池化系统,但我找不到任何我需要/想要的东西。

我正在寻找解决方案:

轮换代理

  • 我希望他们在代理之间随机切换,但绝不会连续两次选择同一个代理。 (Scrapoxy 有这个)

模拟已知浏览器

  • 模拟 Chrome、Firefox、Internet Explorer、Edge、Safari 等(Scrapoxy 有这个)

黑名单慢速代理

  • 如果代理超时或速度慢,则应通过一系列规则将其列入黑名单...(Scrapoxy 仅针对实例/启动数量列入黑名单)

  • 如果代理很慢(占用 x 时间),则应将其标记为 Slow,并应采用时间戳并增加计数器。

  • 如果代理超时,则应将其标记为失败,并记录时间戳并增加计数器。
  • 如果代理在收到最后一次减速后 15 分钟内没有减速,则计数器和时间戳应清零,代理返回到新状态。
  • 如果代理在收到最后一次失败后 30 分钟内没有失败,则计数器和时间戳应清零,代理返回到新状态。
  • 如果代理在 1 小时内变慢 5 次,则应将其从池中移除 1 小时。
  • 如果代理在 1 小时内超时 5 次,则应将其列入黑名单 1 小时
  • 如果代理在 3 小时内被阻止两次,则应将其列入黑名单 12 小时并标记为不良
  • 如果一个代理在 48 小时内两次被标记为错误,那么它应该通知我(电子邮件、push bullet...任何方式)

任何人都知道任何此类解决方案(主要功能是将慢速/超时代理列入黑名单...

最佳答案

由于您的轮询规则非常具体,您可以编写自己的代码,请参阅下面的代码,其中实现了部分规则(您必须实现其他部分):

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import pexpect,time
from random import shuffle

#this func is use to test a single proxy
def test_proxy(ip,port,max_timeout=1):
child = pexpect.spawn("telnet " + ip + " " +str(port))
time_send_request=time.time()
try:
i=child.expect(["Connected to","Connection refused"], timeout=max_timeout) #max timeout in seconds
except pexpect.TIMEOUT:
i=-1
if i==0:
time_request_ok=time.time()
return {"status":True,"time_to_answer":time_request_ok-time_send_request}
else:
return {"status":False,"time_to_answer":max_timeout}


#this func is use to test all the current proxy and update status and apply your custom rules
def update_proxy_list_status(proxy_list):
for i in range(0,len(proxy_list)):
print ("testing proxy "+str(i)+" "+proxy_list[i]["ip"]+":"+str(proxy_list[i]["port"]))
proxy_status = test_proxy(proxy_list[i]["ip"],proxy_list[i]["port"])
proxy_list[i]["status_ok"]= proxy_status["status"]


print proxy_status

#here it is time to treat your own rule to update respective proxy dict

#~ If a proxy is slow (takes over x time) it should be marked as Slow and a timestamp should be taken and a counter should be increased.
#~ If a proxy timeout's it should be marked as Fail and a timestamp should be taken and a counter should be increased.
#~ If a proxy has no slows for 15 minutes after receiving its last slow then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
#~ If a proxy has no fails for 30 minutes after receiving its last fail then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
#~ If a proxy is slow 5 times in 1 hour then it should be removed from the pool for 1 hour.
#~ If a proxy timeout's 5 times in 1 hour then it should be blacklisted for 1 hour
#~ If a proxy get's blocked twice in 3 hours it should be blacklisted for 12 hours and marked as bad
#~ If a proxy gets marked as bad twice in 48 hours then it should notify me (email, push bullet... anything)

if proxy_status["status"]==True:
#modify proxy dict with your own rules (adding timestamp, last check time, last down, last up eFIRSTtc...)
#...
pass
else:
#modify proxy dict with your own rules (adding timestamp, last check time, last down, last up etc...)
#...
pass

return proxy_list


#this func select a good proxy and do the job
def main():

#first populate a proxy list | I get those example proxies list from http://free-proxy.cz/en/
proxy_list=[
{"ip":"167.99.2.12","port":8080}, #bad proxy
{"ip":"167.99.2.17","port":8080},
{"ip":"66.70.160.171","port":1080},
{"ip":"192.99.220.151","port":8080},
{"ip":"142.44.137.222","port":80}
# [...]
]



#this variable is use to keep track of last used proxy (to avoid to use the same one two consecutive time)
previous_proxy_ip=""

the_job=True
while the_job:

#here we update each proxy status
proxy_list = update_proxy_list_status(proxy_list)

#we keep only proxy considered as ok
good_proxy_list = [d for d in proxy_list if d['status_ok']==True]

#here you can shuffle the list
shuffle(good_proxy_list)

#select a proxy (not same last previous one)
current_proxy={}
for i in range(0,len(good_proxy_list)):
if good_proxy_list[i]["ip"]!=previous_proxy_ip:
previous_proxy_ip=good_proxy_list[i]["ip"]
current_proxy=good_proxy_list[i]
break

#use this selected proxy to do the job
print ("the current proxy is: "+str(current_proxy))

#UPDATE SCRAPY PROXY

#DO THE SCRAPY JOB
print "DO MY SCRAPY JOB with the current proxy settings"

#wait some seconds
time.sleep(5)

main()

关于python - Scrapy 的代理池系统暂时停止使用慢速/超时代理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48910982/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com