gpt4 book ai didi

python - 如何从scrapy spider设置爬虫参数

转载 作者:太空宇宙 更新时间:2023-11-04 03:00:25 24 4
gpt4 key购买 nike

enter image description here

我正在尝试将在 scrapy 蜘蛛中设置的数据库表参数传递给管道对象以跟进问题 How to pass parameter to a scrapy pipeline object .基于这个问题的答案,我有:

@classmethod
def from_crawler(cls, crawler):
# Here, you get whatever value was passed through the "table" parameter
settings = crawler.settings
table = settings.get('table')

# Instantiate the pipeline with your table
return cls(table)

def __init__(self, table):
_engine = create_engine("sqlite:///data.db")
_connection = _engine.connect()
_metadata = MetaData()
_stack_items = Table(table, _metadata,
Column("id", Integer, primary_key=True),
Column("detail_url", Text),
_metadata.create_all(_engine)
self.connection = _connection
self.stack_items = _stack_items

我的蜘蛛看起来像:

class my_Spider(Spider):

name = "my"

def from_crawler(self, crawler, table='test'):
pass


def start_requests(self):

.....

我根据 https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.from_crawler 添加了 from_crawler 行,但现在我得到:

  File "C:\ENVS\virtalenvs\contact\lib\site-packages\twisted\internet\defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "C:\ENVS\virtalenvs\contact\lib\site-packages\scrapy\crawler.py", line 90, in crawl
six.reraise(*exc_info)
File "C:\ENVS\virtalenvs\contact\lib\site-packages\scrapy\crawler.py", line 71, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "C:\ENVS\virtalenvs\contact\lib\site-packages\scrapy\crawler.py", line 94, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
TypeError: unbound method from_crawler() must be called with My_Spider instance as first argument (got Crawler instance instead)

我怎样才能让它工作?

编辑:

更改类方法后我得到:

exceptions.TypeError: __init__() takes exactly 1 argument (2 given)
2016-12-09 15:47:37 [twisted] CRITICAL:
Traceback (most recent call last):
File "C:\ENVS\virtalenvs\contact\lib\site-packages\twisted\internet\defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "C:\ENVS\virtalenvs\contact\lib\site-packages\scrapy\crawler.py", line 90, in crawl
six.reraise(*exc_info)
File "C:\ENVS\virtalenvs\contact\lib\site-packages\scrapy\crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "C:\ENVS\virtalenvs\contact\lib\site-packages\scrapy\crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "C:\ENVS\virtalenvs\contact\lib\site-packages\scrapy\core\engine.py", line 69, in __init__
self.scraper = Scraper(crawler)
File "C:\ENVS\virtalenvs\contact\lib\site-packages\scrapy\core\scraper.py", line 71, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "C:\ENVS\virtalenvs\contact\lib\site-packages\scrapy\middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "C:\ENVS\virtalenvs\contact\lib\site-packages\scrapy\middleware.py", line 36, in from_settings
mw = mwcls.from_crawler(crawler)
File "C:\ENVS\r2\my\my\pipelines.py", line 30, in from_crawler
return cls(table_name)
TypeError: __init__() takes exactly 1 argument (2 given)

最佳答案

要将参数传递给正在运行的蜘蛛(当你调用scrapy crawl myspider时),你只需要在shell中用-a参数指定它:

scrapy crawl myspider -a arg1=value1

所以如果你有一个爬虫类:

class MySpider(Spider):
name = "myspider"

这个 arg1 参数将作为实际参数传递给该蜘蛛实例,这意味着您将能够在该类的任何地方使用它:

class MySpider(Spider):

name = "myspider"

...

def some_callback_method(self, response):
print self.arg1
...

无需在实际蜘蛛中设置from_crawler

管道还接收一个蜘蛛实例,您已经在那里使用它了。

更新:

现在在您的 pipeline 中,您实际上并没有使用“蜘蛛属性”,而是在 scrapy 设置中使用了一个变量。如果你想将表名作为蜘蛛参数传递(因此要从命令行使用 -a),你必须将你的管道更改为:

...
@classmethod
def from_crawler(cls, crawler):
table_name = getattr(crawler.spider, "table")
return cls(table_name)
...

关于python - 如何从scrapy spider设置爬虫参数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41066481/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com