gpt4 book ai didi

python - 是否可以从 Scrapy spider 运行另一个 spider?

转载 作者:太空狗 更新时间:2023-10-29 21:58:22 36 4
gpt4 key购买 nike

现在我有2个蜘蛛,我想做的是

  1. Spider 1 转到 url1 如果 url2 出现,用 url2< 调用 spider 2/。还使用管道保存 url1 的内容。
  2. 蜘蛛 2 转到 url2 并做一些事情。

由于两个蜘蛛的复杂性,我想将它们分开。

我尝试使用 scrapy crawl 的结果:

def parse(self, response):
p = multiprocessing.Process(
target=self.testfunc())
p.join()
p.start()

def testfunc(self):
settings = get_project_settings()
crawler = CrawlerRunner(settings)
crawler.crawl(<spidername>, <arguments>)

它加载设置但不抓取:

2015-08-24 14:13:32 [scrapy] INFO: Enabled extensions: CloseSpider, LogStats, CoreStats, SpiderState
2015-08-24 14:13:32 [scrapy] INFO: Enabled downloader middlewares: DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, HttpAuthMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 14:13:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 14:13:32 [scrapy] INFO: Spider opened
2015-08-24 14:13:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

文档有一个关于从脚本启动的示例,但我想做的是在使用 scrapy crawl 命令时启动另一个蜘蛛。

编辑:完整代码

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from multiprocessing import Process
import scrapy
import os


def info(title):
print(title)
print('module name:', __name__)
if hasattr(os, 'getppid'): # only available on Unix
print('parent process:', os.getppid())
print('process id:', os.getpid())


class TestSpider1(scrapy.Spider):
name = "test1"
start_urls = ['http://www.google.com']

def parse(self, response):
info('parse')
a = MyClass()
a.start_work()


class MyClass(object):

def start_work(self):
info('start_work')
p = Process(target=self.do_work)
p.start()
p.join()

def do_work(self):

info('do_work')
settings = get_project_settings()
runner = CrawlerRunner(settings)
runner.crawl(TestSpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
return

class TestSpider2(scrapy.Spider):

name = "test2"
start_urls = ['http://www.google.com']

def parse(self, response):
info('testspider2')
return

我希望的是这样的:

  1. scrapy 爬行测试 1(例如,当 response.status_code 为 200 时:)
  2. 在test1中,调用scrapy crawl test2

最佳答案

我不会深入探讨,因为这个问题真的很老,但我会继续从官方 Scrappy 文档中删除这个片段......你非常接近!哈哈

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
# Your first spider definition
...

class MySpider2(scrapy.Spider):
# Your second spider definition
...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

https://doc.scrapy.org/en/latest/topics/practices.html

然后使用回调你可以在你的蜘蛛之间传递项目做你谈论的逻辑功能

关于python - 是否可以从 Scrapy spider 运行另一个 spider?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32176005/

36 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com