gpt4 book ai didi

python - Scrapy中遇到某种情况如何跳出爬行

转载 作者:太空宇宙 更新时间:2023-11-03 21:35:12 25 4
gpt4 key购买 nike

出于我正在开发的应用程序的目的,我需要 scrapy 来中断爬网并从特定的任意 URL 重新开始爬网。

scrapy 的预期行为是仅返回到特定 URL,如果满足特定条件,则可以在参数中提供该 URL。

我正在使用 CrawlSpider,但不知道如何实现这一点:

class MyCrawlSpider(CrawlSpider):
name = 'mycrawlspider'
initial_url = ""

def __init__(self, initial_url, *args, **kwargs):
self.initial_url = initial_url
domain = "mydomain.com"
self.start_urls = [initial_url]
self.allowed_domains = [domain]
self.rules = (
Rule(LinkExtractor(allow=[r"^http[s]?://(www.)?" + domain + "/.*"]), callback='parse_item', follow=True),
)

super(MyCrawlSpider, self)._compile_rules()


def parse_item(self, response):
if(some_condition is True):
# force scrapy to go back to home page and recrawl
print("Should break out")

else:
print("Just carry on")

我尝试放置

return scrapy.Request(self.initial_url, callback=self.parse_item)

someCondition is True的分支中,但没有成功。非常感谢一些帮助,我已经花了几个小时试图解决这个问题。

最佳答案

您可以创建一个适当处理的自定义异常,就像这样......

请随意使用 CrawlSpider 的适当语法进行编辑

class RestartException(Exception):
pass

class MyCrawlSpider(CrawlSpider):
name = 'mycrawlspider'
initial_url = ""

def __init__(self, initial_url, *args, **kwargs):
self.initial_url = initial_url
domain = "mydomain.com"
self.start_urls = [initial_url]
self.allowed_domains = [domain]
self.rules = (
Rule(LinkExtractor(allow=[r"^http[s]?://(www.)?" + domain + "/.*"]), callback='parse_item', follow=True),
)

super(MyCrawlSpider, self)._compile_rules()


def parse_item(self, response):
if(some_condition is True):

print("Should break out")
raise RestartException("We're restarting now")

else:
print("Just carry on")

siteName = "http://whatever.com"
crawler = MyCrawlSpider(siteName)
while True:
try:
#idk how you start this thing, but do that

crawler.run()
break
except RestartException as err:
print(err.args)
crawler.something = err.args
continue

print("I'm done!")

关于python - Scrapy中遇到某种情况如何跳出爬行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53283663/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com