gpt4 book ai didi

python - CrawlerProcess 与 CrawlerRunner

转载 作者:IT老高 更新时间:2023-10-28 20:53:16 45 4
gpt4 key购买 nike

Scrapy 1.x documentation解释了有两种方法可以从脚本中运行 Scrapy 蜘蛛:

两者有什么区别?什么时候用“process”,什么时候用“runner”?

最佳答案

Scrapy 的文档在给出两者的实际应用示例方面做得非常糟糕。

CrawlerProcess 假设scrapy 是唯一会使用twisted react 器的东西。如果您在 python 中使用线程来运行其他代码,这并不总是正确的。让我们以此为例。

from scrapy.crawler import CrawlerProcess
import scrapy
def notThreadSafe(x):
"""do something that isn't thread-safe"""
# ...
class MySpider1(scrapy.Spider):
# Your first spider definition
...

class MySpider2(scrapy.Spider):
# Your second spider definition
...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
notThreadSafe(3) # it will get executed when the crawlers stop

现在,如您所见,该函数只会在爬虫停止时执行,如果我希望在爬虫在同一个 react 器中爬行时执行该函数怎么办?

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
import scrapy

def notThreadSafe(x):
"""do something that isn't thread-safe"""
# ...

class MySpider1(scrapy.Spider):
# Your first spider definition
...

class MySpider2(scrapy.Spider):
# Your second spider definition
...
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.callFromThread(notThreadSafe, 3)
reactor.run() #it will run both crawlers and code inside the function

Runner 类不限于此功能,您可能需要在 react 器上进行一些自定义设置(延迟、线程、getPage、自定义错误报告等)

关于python - CrawlerProcess 与 CrawlerRunner,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39706005/

45 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com