gpt4 book ai didi

python - Scrapy:动态地将参数从命令行传递到管道

转载 作者:太空宇宙 更新时间:2023-11-04 00:24:22 24 4
gpt4 key购买 nike

我正在使用 scrapy。我有一个以以下开头的蜘蛛:

class For_Spider(Spider):

name = "for"
table = 'hello' # creating dummy attribute. will be overwritten

def start_requests(self):

self.table = self.dc # dc is passed in

我有以下管道:

class DynamicSQLlitePipeline(object):

@classmethod
def from_crawler(cls, crawler):
# Here, you get whatever value was passed through the "table" parameter
table = getattr(crawler.spider, "table")
return cls(table)

def __init__(self,table):
try:
db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"
db = dataset.connect(db_path)
table_name = table[0:3] # FIRST 3 LETTERS
self.my_table = db[table_name]

当我启动蜘蛛时:

scrapy crawl for -a dc=input_string -a records=1

在重复执行并借助 What is the relationship between the crawler object with spider and pipeline objects? 等问题的帮助后, 看来执行顺序是:

1) For_spider
2) DynamicSQLlitePipeline
3) start_requests

蜘蛛“表”中的参数通过 from_crawler 方法传递给 DynamicSQLlitePipeline 对象,该方法可以访问 scrapy 系统的不同组件。表被初始化为我设置的“hello”(一个虚拟变量)。在上面的 1 和 2 之后执行返回到蜘蛛和 start_requests开始。命令行参数仅在 start_requests 中可用,因此动态设置表名为时已晚,因为管道已经实例化。

所以不知道有没有办法动态设置管道表名。我该怎么做。

编辑:

elRuLL 是正确的,他的解决方案有效。我查看了步骤 1 中的 spider 对象,并没有找到 spider 中列出的任何参数。我想念他们吗?

>>> Spider.__dict__
mappingproxy({'__module__': 'scrapy.spiders', '__doc__': 'Base class for scrapy spiders. All spiders must inherit from this\n class.\n ', 'name': None, 'custom_settings': None, '__init__': <function Spider.__init__ at 0x00000000047A6D90>, 'logger': <property object at 0x0000000003E0E598>, 'log': <function Spider.log at 0x00000000047A6EA0>, 'from_crawler': <classmethod object at 0x0000000003B28278>, 'set_crawler': <function Spider.set_crawler at 0x00000000047C9048>, '_set_crawler': <function Spider._set_crawler at 0x00000000047C90D0>, 'start_requests': <function Spider.start_requests at 0x00000000047C9158>, 'make_requests_from_url': <function Spider.make_requests_from_url at 0x00000000047C91E0>, 'parse': <function Spider.parse at 0x00000000047C9268>, 'update_settings': <classmethod object at 0x0000000003912C88>, 'handles_request': <classmethod object at 0x0000000003E0B7F0>, 'close': <staticmethod object at 0x0000000004756BA8>, '__str__': <function Spider.__str__ at 0x00000000047C9488>, '__repr__': <function Spider.__str__ at 0x00000000047C9488>, '__dict__': <attribute '__dict__' of 'Spider' objects>, '__weakref__': <attribute '__weakref__' of 'Spider' objects>})

最佳答案

文档中有示例 how to create pipeline to write in MongoDB

它使用def open_spider(self, spider): 打开数据库。
并且有变量 spider 可以让你访问 spider 这样你就可以得到你的变量

def open_spider(self, spider):

table = spider.table

所以它可能是(类似于文档中的代码)

class DynamicSQLlitePipeline(object):

def open_spider(self, spider):

table = spider.table

try:
db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"
self.db = dataset.connect(db_path)
table_name = table[0:3] # FIRST 3 LETTERS
self.my_table = self.db[table_name]
# ... rest ...

def close_spider(self, spider):
self.db.close()

def process_item(self, item, spider):
self.my_table.insert_one(dict(item))
return item

关于python - Scrapy:动态地将参数从命令行传递到管道,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47981620/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com