gpt4 book ai didi

python-3.x - scrapy 使用 CrawlerProcess.crawl() 将 custom_settings 从脚本传递给蜘蛛

转载 作者:行者123 更新时间:2023-12-04 17:44:05 29 4
gpt4 key购买 nike

我正在尝试通过脚本以编程方式调用蜘蛛。我无法使用 CrawlerProcess 通过构造函数覆盖设置。让我用默认的爬虫来说明这一点,它用于从官方 scrapy 站点(最后一个代码片段在 official scrapy quotes example spider 处)抓取报价。

class QuotesSpider(Spider):

name = "quotes"

def __init__(self, somestring, *args, **kwargs):
super(QuotesSpider, self).__init__(*args, **kwargs)
self.somestring = somestring
self.custom_settings = kwargs


def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield Request(url=url, callback=self.parse)

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}

这是我尝试运行报价蜘蛛的脚本
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings

def main():

proc = CrawlerProcess(get_project_settings())

custom_settings_spider = \
{
'FEED_URI': 'quotes.csv',
'LOG_FILE': 'quotes.log'
}
proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
proc.start()

最佳答案

Scrapy Settings 有点像 Python dicts。
因此,您可以在将设置对象传递给 CrawlerProcess 之前更新设置对象:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings

def main():

s = get_project_settings()
s.update({
'FEED_URI': 'quotes.csv',
'LOG_FILE': 'quotes.log'
})
proc = CrawlerProcess(s)

proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
proc.start()

编辑以下 OP 的评论:

这是使用 CrawlerRunner 的变体, 带有新 CrawlerRunner对于每次爬网并在每次迭代时重新配置日志记录以每次写入不同的文件:
import logging
from twisted.internet import reactor, defer

import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging, _get_handler
from scrapy.utils.project import get_project_settings


class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
page = getattr(self, 'page', 1)
yield scrapy.Request('http://quotes.toscrape.com/page/{}/'.format(page),
self.parse)

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}


@defer.inlineCallbacks
def crawl():
s = get_project_settings()
for i in range(1, 4):
s.update({
'FEED_URI': 'quotes%03d.csv' % i,
'LOG_FILE': 'quotes%03d.log' % i
})

# manually configure logging for LOG_FILE
configure_logging(settings=s, install_root_handler=False)
logging.root.setLevel(logging.NOTSET)
handler = _get_handler(s)
logging.root.addHandler(handler)

runner = CrawlerRunner(s)
yield runner.crawl(QuotesSpider, page=i)

# reset root handler
logging.root.removeHandler(handler)
reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

关于python-3.x - scrapy 使用 CrawlerProcess.crawl() 将 custom_settings 从脚本传递给蜘蛛,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42511814/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com