gpt4 book ai didi

python - Scrapy:通过管道从数据库获取Start_Urls

转载 作者:太空宇宙 更新时间:2023-11-03 14:44:33 25 4
gpt4 key购买 nike

不幸的是,我没有足够的人口来发表评论,所以我必须提出这个新问题,引用https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider

我在数据库中有很多网址。所以我想从我的数据库获取 start_url。到目前为止还不是大问题。好吧,我不希望 mysql 的东西在蜘蛛里面,在管道中我遇到了问题。如果我尝试将管道对象移交给我的蜘蛛,就像所提到的问题一样,我只会收到带有消息的属性错误

'None Type' object has no attribute getUrl

我认为实际的问题是函数spider_opened没有被调用(还插入了一条打印语句,该语句从未在控制台中显示其输出)。有人知道如何在蜘蛛内部获取管道对象吗?

MySpider.py

def __init__(self):
self.pipe = None

def start_requests(self):
url = self.pipe.getUrl()
scrapy.Request(url,callback=self.parse)

管道.py

@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
def spider_opened(self, spider):
spider.pipe = self

def getUrl(self):
...

最佳答案

Scrapy 管道已经具有 open_spiderclose_spider 的预期方法

取自文档:https://doc.scrapy.org/en/latest/topics/item-pipeline.html#open_spider

open_spider(self, spider)
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was opened

close_spider(self, spider)
This method is called when the spider is closed. Parameters: spider (Spider object) – the spider which was closed

但是您最初的问题没有多大意义,为什么要为您的蜘蛛分配管道引用?这似乎是一个非常糟糕的主意。

你应该做的是打开数据库并读取你的蜘蛛本身的网址。

from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = []

@classmethod
def from_crawler(self, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
spider.start_urls = self.get_urls_from_db()
return spider

def get_urls_from_db(self):
db = # get db cursor here
urls = # use cursor to pop your urls
return urls

关于python - Scrapy:通过管道从数据库获取Start_Urls,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46339263/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com