gpt4 book ai didi

python - 如果没有可抓取的 URL,则 Scrapy 关闭蜘蛛

转载 作者:行者123 更新时间:2023-11-28 21:41:01 25 4
gpt4 key购买 nike

我有一个从 redis 列表中获取 url 的蜘蛛。

当没有找到 url 时,我想很好地关闭蜘蛛。我尝试实现了CloseSpider异常,但是好像没有到这一步

def start_requests(self):
while True:
item = json.loads(self.__pop_queue())
if not item:
raise CloseSpider("Closing spider because no more urls to crawl")
try:
yield scrapy.http.Request(item['product_url'], meta={'item': item})
except ValueError:
continue

即使我提出了 CloseSpider 异常,但我仍然收到以下错误:

root@355e42916706:/scrapper# scrapy crawl general -a country=my -a log=file
2017-07-17 12:05:13 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/scrapper/scrapper/spiders/GeneralSpider.py", line 20, in start_requests
item = json.loads(self.__pop_queue())
File "/usr/local/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

此外,我还尝试在同一函数中捕获 TypeError,但它也不起作用。

有没有推荐的方法来处理这个问题

谢谢

最佳答案

您需要检查 self.__pop_queue() 在将其提供给 json.loads() 之前是否返回某些内容(或捕获 TypeError 调用它时),类似于:

def start_requests(self):
while True:
item = self.__pop_queue()
if not item:
raise CloseSpider("Closing spider because no more urls to crawl")
try:
item = json.loads(item)
yield scrapy.http.Request(item['product_url'], meta={'item': item})
except (ValueError, TypeError): # just in case the 'item' is not a string or buffer
continue

关于python - 如果没有可抓取的 URL,则 Scrapy 关闭蜘蛛,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45143947/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com