gpt4 book ai didi

python - Scrapy CrawlSpider 不会退出

转载 作者:行者123 更新时间:2023-12-01 09:10:56 25 4
gpt4 key购买 nike

我对 scrapy Crawlspider 有一个问题:基本上,如果引发 CloseSpider 异常,它不会像应该的那样退出。下面是代码:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.exceptions import CloseSpider
from scrapy.linkextractors import LinkExtractor
import re

class RecursiveSpider(CrawlSpider):

name = 'recursive_spider'
start_urls = ['https://www.webiste.com/']

rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)

miss = 0
hits = 0

def quit(self):
print("ABOUT TO QUIT")
raise CloseSpider('limits_exceeded')


def parse_item(self, response):
item = dict()
item['url'] = response.url
item['body'] = '\n'.join(response.xpath('//text()').extract())
try:
match = re.search(r"[A-za-z]{0,1}edical[a-z]{2}", response.body_as_unicode()).group(0)
except:
match = 'NOTHING'

print("\n")
print("\n")
print("\n")
print("****************************************INFO****************************************")
if "string" in item['url']:
print(item['url'])
print(match)
print(self.hits)
self.hits += 10
if self.hits > 10:
print("HITS EXCEEDED")
self.quit()
else:
self.miss += 1
print(self.miss)
if self.miss > 10:
print("MISS EXCEEDED")
self.quit()
print("\n")
print("\n")
print("\n")

问题是,虽然我可以看到它进入了条件,并且我可以看到日志中引发的 Eception,但爬虫仍在继续爬行。我运行它:

scrapy crawl recursive_spider

最佳答案

我猜这是一个 scrapy 只是花了太长时间才关闭而不是真正忽略异常的情况。引擎在运行完所有计划/发送的请求之前不会退出,因此我建议降低 CONCURRENT_REQUESTS/CONCURRENT_REQUESTS_PER_DOMAIN 设置的值,看看这是否适合您。

关于python - Scrapy CrawlSpider 不会退出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51658531/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com