gpt4 book ai didi

python - 当 CrawlerProcess 运行两次时,Scrapy 引发 ReactorNotRestartable

转载 作者:行者123 更新时间:2023-12-04 16:45:02 26 4
gpt4 key购买 nike

我有一些看起来像这样的代码:

def run(spider_name, settings):
runner = CrawlerProcess(settings)
runner.crawl(spider_name)
runner.start()
return True

我有两个 py.test 测试,每个测试都调用 run(),当第二个测试执行时,我收到以下错误。
    runner.start()
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/scrapy/crawler.py:291: in start
reactor.run(installSignalHandlers=False) # blocking call
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1242: in run
self.startRunning(installSignalHandlers=installSignalHandlers)
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1222: in startRunning
ReactorBase.startRunning(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <twisted.internet.selectreactor.SelectReactor object at 0x10fe21588>

def startRunning(self):
"""
Method called when reactor starts: do some initialization and fire
startup events.

Don't call this directly, call reactor.run() instead: it should take
care of calling this.

This method is somewhat misnamed. The reactor will not necessarily be
in the running state by the time this method returns. The only
guarantee is that it will be on its way to the running state.
"""
if self._started:
raise error.ReactorAlreadyRunning()
if self._startedBefore:
> raise error.ReactorNotRestartable()
E twisted.internet.error.ReactorNotRestartable

我知道这个 react 堆已经在运行,所以我不能 runner.start()当第二次测试运行时。但是有没有办法在测试之间重置它的状态?所以他们更加孤立,实际上可以互相追逐。

最佳答案

如果您使用 CrawlerRunner 而不是 CrawlerProcess结合 pytest-twisted ,你应该能够像这样运行你的测试:

为 Pytest 安装 Twisted 集成:pip install pytest-twisted

from scrapy.crawler import CrawlerRunner

def _run_crawler(spider_cls, settings):
"""
spider_cls: Scrapy Spider class
settings: Scrapy settings
returns: Twisted Deferred
"""
runner = CrawlerRunner(settings)
return runner.crawl(spider_cls) # return Deferred


def test_scrapy_crawler():
deferred = _run_crawler(MySpider, settings)

@deferred.addCallback
def _success(results):
"""
After crawler completes, this function will execute.
Do your assertions in this function.
"""

@deferred.addErrback
def _error(failure):
raise failure.value

return deferred

说白了就是 _run_crawler()将在 Twisted react 器中安排爬行,并在爬取完成时执行回调。在这些回调( _success()_error() )中,您将进行断言。最后,您必须返回 Deferred来自 _run_crawler() 的对象以便测试等到爬行完成。这部分与 Deferred , 是必不可少的,必须对所有测试进行。

这是一个如何使用 gatherResults 运行多次爬网和聚合结果的示例。 .
from twisted.internet import defer

def test_multiple_crawls():
d1 = _run_crawler(Spider1, settings)
d2 = _run_crawler(Spider2, settings)

d_list = defer.gatherResults([d1, d2])

@d_list.addCallback
def _success(results):
assert True

@d_list.addErrback
def _error(failure):
assert False

return d_list

我希望这会有所帮助,如果没有,请询​​问您在哪里挣扎。

关于python - 当 CrawlerProcess 运行两次时,Scrapy 引发 ReactorNotRestartable,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48913525/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com