gpt4 book ai didi

python - scrapy-spash : SplashRequest response object differs between invocation by scrapy crawl vs CrawlerProcess

转载 作者:行者123 更新时间:2023-12-01 08:13:40 27 4
gpt4 key购买 nike

我想使用 scrapy-splash 来获取目标页面的 html 和屏幕截图 png。我需要能够以编程方式调用它。根据spashy doc ,指定

endpoint='render.json'

并传递参数

'png': 1

应该生成一个带有 .data 属性的响应对象('scrapy_splash.response.SplashJsonResponse'),该属性包含表示目标页面的 png 屏幕截图的解码 JSON 数据。

当蜘蛛(此处名为“搜索”)被调用时

scrapy crawl search

结果符合预期,response.data['png'] 包含 png 数据。

但是,如果通过 scrapy 的 CrawlerProcess 调用它,则会返回不同的响应对象:'scrapy.http.response.html.HtmlResponse'。该对象具有 .data 属性。

代码如下:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy_splash import SplashRequest
import base64

RUN_CRAWLERPROCESS = False

if RUN_CRAWLERPROCESS:
from crochet import setup
setup()

class SpiderSearch(scrapy.Spider):
name = 'search'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'

def start_requests(self):
urls = ['https://www.google.com/search?q=test', ]
splash_args = {
'html': 1,
'png': 1,
'width': 1920,
'wait': 0.5,
'render_all': 1,
}
for url in urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.json', args=splash_args, )

def parse(self, response):
print(type(response))
for result in response.xpath('//div[@class="r"]'):
url = str(result.xpath('./a/@href').extract_first())
yield {
'url': url
}

png_bytes = base64.b64decode(response.data['png'])
with open('google_results.png', 'wb') as f:
f.write(png_bytes)

splash_args = {
'html': 1,
'png': 1,
'width': 1920,
'wait': 2,
'render_all': 1,
'html5_media': 1,
}
# cue the subsequent url to be fetched (self.parse_result omitted here for brevity)
yield SplashRequest(url=url, callback=self.parse_result, endpoint='render.json', args=splash_args)

if RUN_CRAWLERPROCESS:
runner = CrawlerProcess({'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'})
#d = runner.crawl(SpiderSearch)
#d.addBoth(lambda _: reactor.stop())
#reactor.run()
runner.crawl(SpiderSearch)
runner.start()

重申:

RUN_CRAWLERPROCESS = False 

并调用

scrapy crawl search

响应类型

class 'scrapy_splash.response.SplashJsonResponse'

但是设置

RUN_CRAWLERPROCESS = True 

并使用 CrawlerProcess 运行脚本会产生类型响应

class 'scrapy.http.response.html.HtmlResponse'

(p.s.我在 ReactorNotRestartable 方面遇到了一些麻烦,因此采用了 this post 中描述的钩针,这似乎解决了问题。我承认我不明白为什么,但假设它不相关......)

关于如何调试这个有什么想法吗?

最佳答案

如果您将此代码作为独立脚本运行,则设置模块将永远不会被加载,并且您的抓取工具将不知道 Splashy 中间件(这就是添加您要添加​​的 .data 属性的内容)在 .parse 中引用)。

您可以通过调用 get_project_settings 并将结果传递给您的爬网程序,在脚本中加载这些设置:

from scrapy.utils.project import get_project_settings

# ...

project_settings = get_project_settings()


process = CrawlerProcess(project_settings)

关于python - scrapy-spash : SplashRequest response object differs between invocation by scrapy crawl vs CrawlerProcess,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55084220/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com