gpt4 book ai didi

python - Scrapy:根据条件停止先前的解析功能

转载 作者:太空宇宙 更新时间:2023-11-04 05:37:07 25 4
gpt4 key购买 nike

我现在正在开发的一个爬虫有一个非常特殊的情况。第一个函数 parse_posts_pages 遍历特定论坛页面的所有页面,并为每个页面调用第二个函数 parse_posts。

def parse_posts_pages(self, response):
thread_id = response.meta['thread_id']
thread_link = response.meta['thread_link']
thread_name = response.meta['thread_name']
if len(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')) == 3:
posts_per_page = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[1])
total_posts = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[2])
if posts_per_page > 0:
post_mod = total_posts % posts_per_page
pages = total_posts / posts_per_page
if post_mod > 0: pages += 1
else: pages = 1

for page in range(pages, 0, -1):
cur_page = '' if page == 1 else '/page' + str(page)
post_page_link = thread_link + cur_page
return scrapy.Request(post_page_link, self.parse_posts, meta={'thread_id': thread_id, 'thread_name': thread_name})


def parse_posts(self, response):
global maxPostIDByThread, executeFullSpider
thread_id = response.meta['thread_id']
thread_name = response.meta['thread_name']
for post in response.xpath('//*[@id="posts"]/li'):
post_id = post.xpath('@id').re(r'(\d.*)')[0]
if not executeFullSpider and post_id in maxPostIDByThread:
break #<- I need this break to also cancel the for from parse_posts_pages function
...

在第二个函数中有一个 if 条件。当此条件为真时,我需要中断当前的 for 循环以及来自 parse_posts_pages 的 for 循环,因为不需要继续分页。

有没有办法从第二个函数停止第一个函数中的for循环?

最佳答案

按照手册中的描述,只需引发 CloseSpider

How can I instruct a spider to stop itself?

Raise the CloseSpider from a callback.

from scrapy.exceptions import CloseSpider

def parse_page(self, response):
if 'Bandwidth exceeded' in response.body:
raise CloseSpider('bandwidth_exceeded')

http://doc.scrapy.org/en/latest/faq.html#how-can-i-instruct-a-spider-to-stop-itself http://doc.scrapy.org/en/latest/topics/exceptions.html#scrapy.exceptions.CloseSpider

Note that requests that are still in progress (HTTP request sent,response not yet received) will still be parsed. No new request willbe processed though.

https://stackoverflow.com/a/23895143/5041915

更新:实际上我发现了一些有趣的东西 If stop spider in main function.

可能会出现新的有效 worker 没有时间启动的情况,因为引发异常的速度更快。

我建议在回调函数中检查条件并尽早引发异常。

关于python - Scrapy:根据条件停止先前的解析功能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35244392/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com