gpt4 book ai didi

scrapy - 在 Celery 任务中运行 Scrapy 蜘蛛

转载 作者:行者123 更新时间:2023-12-03 09:07:13 25 4
gpt4 key购买 nike

This is not working anymore ,scrapy 的 API 发生了变化。

现在文档提供了一种“Run Scrapy from a script”的方法,但我得到了 ReactorNotRestartable错误。

我的任务:

from celery import Task

from twisted.internet import reactor

from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings

from .spiders import MySpider



class MyTask(Task):
def run(self, *args, **kwargs):
spider = MySpider
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()

log.start()
reactor.run()

最佳答案

扭曲的 react 堆无法重新启动。解决此问题的方法是让 celery 任务为您想要执行的每个爬网创建一个新的子进程,如下面的帖子中建议的那样:

  • Running Scrapy spiders in a Celery task

  • 这通过使用 multiprocessing 解决了“reactor 无法重新启动”问题。包裹。但问题在于,由于您将遇到另一个问题,即守护进程无法生成子进程,因此最新的 celery 版本现在已经过时了。因此,为了使解决方法起作用,您需要使用 celery 版本。
    是的, scrapy API 已更改。但稍作修改( import Crawler 而不是 CrawlerProcess )。您可以通过使用 celery 版本来获得解决方法。

    The Celery issue can be found here:Celery Issue #1709


    这是我的 更新了有效的爬网脚本通过使用 billiard 使用较新的 celery 版本而不是 multiprocessing :
    from scrapy.crawler import Crawler
    from scrapy.conf import settings
    from myspider import MySpider
    from scrapy import log, project
    from twisted.internet import reactor
    from billiard import Process
    from scrapy.utils.project import get_project_settings
    from scrapy import signals


    class UrlCrawlerScript(Process):
    def __init__(self, spider):
    Process.__init__(self)
    settings = get_project_settings()
    self.crawler = Crawler(settings)
    self.crawler.configure()
    self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    self.spider = spider

    def run(self):
    self.crawler.crawl(self.spider)
    self.crawler.start()
    reactor.run()

    def run_spider(url):
    spider = MySpider(url)
    crawler = UrlCrawlerScript(spider)
    crawler.start()
    crawler.join()

    Edit: By reading the celery issue #1709 they suggest to use billiard instead of multiprocessing in order for the subprocess limitation to be lifted. In other words we should try billiard and see if it works!


    Edit 2: Yes, by using billiard, my script works with the latest celery build! See my updated script.

    关于scrapy - 在 Celery 任务中运行 Scrapy 蜘蛛,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22116493/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com