gpt4 book ai didi

python - Scrapy Nameko DependencyProvider 不抓取页面

转载 作者:太空宇宙 更新时间:2023-11-03 14:49:02 25 4
gpt4 key购买 nike

我正在使用 scrapy 创建一个示例网络爬虫作为 Nameko 依赖提供程序,但它没有爬行任何页面。下面是代码

import scrapy
from scrapy import crawler
from nameko import extensions
from twisted.internet import reactor


class TestSpider(scrapy.Spider):
name = 'test_spider'
result = None

def parse(self, response):
TestSpider.result = {
'heading': response.css('h1::text').extract_first()
}


class ScrapyDependency(extensions.DependencyProvider):

def get_dependency(self, worker_ctx):
return self

def crawl(self, spider=None):
spider = TestSpider()
spider.name = 'test_spider'
spider.start_urls = ['http://www.example.com']
self.runner = crawler.CrawlerRunner()
self.runner.crawl(spider)
d = self.runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
return spider.result

def run(self):
if not reactor.running:
reactor.run()

这是日志。

Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Enabled item pipelines:
[]
Spider opened
<小时/>
Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
<小时/>
Closing spider (finished)
Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 3, 12, 41, 40, 126088),
'log_count/INFO': 7,
'memusage/max': 59650048,
'memusage/startup': 59650048,
'start_time': datetime.datetime(2017, 9, 3, 12, 41, 40, 97747)}
Spider closed (finished)

在日志中我们可以看到它没有抓取单个页面,预计会抓取一个页面。

然而,如果我创建一个常规的 CrawlerRunner 并抓取页面,我会得到预期的结果 {'heading': 'Example Domain'}。下面是代码:

import scrapy


class TestSpider(scrapy.Spider):
name = 'test_spider'
start_urls = ['http://www.example.com']
result = None

def parse(self, response):
TestSpider.result = {'heading': response.css('h1::text').extract_first()}

def crawl():
runner = crawler.CrawlerRunner()
runner.crawl(TestSpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()

if __name__ == '__main__':
crawl()

这个问题已经困扰了几天,我无法弄清楚何时使用 scrapy 爬虫作为 nameko dependentecy 提供程序无法抓取页面。有错误的地方请指正。

最佳答案

塔伦的评论是正确的。 Nameko 使用 Eventlet 来实现并发,而 Scrapy 使用 Twisted。它们的工作方式相似:有一个主线程(Twisted 中的 Reactor)来调度所有其他工作,作为普通 Python 线程调度程序的替代方案。不幸的是,这两个系统无法互操作。

如果您确实想集成 Nameko 和 Scrapy,最好的选择是为 Scrapy 使用单独的进程,如以下问题的答案所示:

关于python - Scrapy Nameko DependencyProvider 不抓取页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46023741/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com