gpt4 book ai didi

Scrapy `ReactorNotRestartable` : one class to run two (or more) spiders

转载 作者:行者123 更新时间:2023-12-04 14:43:09 27 4
gpt4 key购买 nike

我正在使用两阶段爬网使用 Scrapy 汇总每日数据。第一阶段从索引页面生成 URL 列表,第二阶段将列表中的每个 URL 的 HTML 写入 Kafka 主题。

kafka cluster for Scrapy crawler

尽管爬网的两个组件是相关的,但我希望它们是独立的:url_generator将作为计划任务每​​天运行一次,并且 page_requester将持续运行,在可用时处理 URL。为了“礼貌”,我将调整DOWNLOAD_DELAY以便爬虫在 24 小时内很好地完成,但对站点的负载最小。

我创建了一个 CrawlerRunner具有生成 URL 和检索 HTML 的函数的类:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy_somesite.spiders.create_urls_spider import CreateSomeSiteUrlList
from scrapy_somesite.spiders.crawl_urls_spider import SomeSiteRetrievePages
from scrapy.utils.project import get_project_settings
import os
import sys

class CrawlerRunner:

def __init__(self):
sys.path.append(os.path.join(os.path.curdir, "crawl/somesite"))
os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy_somesite.settings'
self.settings = get_project_settings()
log.start()

def create_urls(self):
spider = CreateSomeSiteUrlList()
crawler_create_urls = Crawler(self.settings)
crawler_create_urls.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler_create_urls.configure()
crawler_create_urls.crawl(spider)
crawler_create_urls.start()
reactor.run()

def crawl_urls(self):
spider = SomeSiteRetrievePages()
crawler_crawl_urls = Crawler(self.settings)
crawler_crawl_urls.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler_crawl_urls.configure()
crawler_crawl_urls.crawl(spider)
crawler_crawl_urls.start()
reactor.run()

当我实例化该类时,我能够成功地单独执行任一函数,但不幸的是,我无法同时执行它们:
from crawl.somesite import crawler_runner

cr = crawler_runner.CrawlerRunner()

cr.create_urls()
cr.crawl_urls()

第二个函数调用生成一个 twisted.internet.error.ReactorNotRestartable当它尝试执行 reactor.run()crawl_urls功能。

我想知道这段代码是否有一个简单的修复方法(例如,以某种方式运行两个独立的 Twisted react 器),或者是否有更好的方法来构建这个项目。

最佳答案

通过保持 react 器打开直到所有蜘蛛都停止运行,可以在一个 react ​​器中运行多个蜘蛛。这是通过保留所有正在运行的蜘蛛的列表而不是执行 reactor.stop() 来实现的。直到此列表为空:

import sys
import os
from scrapy.utils.project import get_project_settings
from scrapy_somesite.spiders.create_urls_spider import Spider1
from scrapy_somesite.spiders.crawl_urls_spider import Spider2

from scrapy import signals, log
from twisted.internet import reactor
from scrapy.crawler import Crawler

class CrawlRunner:

def __init__(self):
self.running_crawlers = []

def spider_closing(self, spider):
log.msg("Spider closed: %s" % spider, level=log.INFO)
self.running_crawlers.remove(spider)
if not self.running_crawlers:
reactor.stop()

def run(self):

sys.path.append(os.path.join(os.path.curdir, "crawl/somesite"))
os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy_somesite.settings'
settings = get_project_settings()
log.start(loglevel=log.DEBUG)

to_crawl = [Spider1, Spider2]

for spider in to_crawl:

crawler = Crawler(settings)
crawler_obj = spider()
self.running_crawlers.append(crawler_obj)

crawler.signals.connect(self.spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()

reactor.run()

类被执行:
from crawl.somesite.crawl import CrawlRunner

cr = CrawlRunner()
cr.run()

此解决方案基于 blogpost by Kiran Koduru .

关于Scrapy `ReactorNotRestartable` : one class to run two (or more) spiders,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30970436/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com