gpt4 book ai didi

scrapy - 在scrapy中延迟请求

转载 作者:行者123 更新时间:2023-12-04 09:41:59 30 4
gpt4 key购买 nike

我想以不同的延迟重复抓取相同的 URL。在研究了这个问题之后,似乎合适的解决方案是使用类似的东西

nextreq = scrapy.Request(url, dont_filter=True)
d = defer.Deferred()
delay = 1
reactor.callLater(delay, d.callback, nextreq)
yield d

在解析中。

但是,我一直无法完成这项工作。我收到错误消息 ERROR: Spider must return Request, BaseItem, dict or None, got 'Deferred'
我不熟悉扭曲所以我希望我只是遗漏了一些明显的东西

有没有更好的方法来实现我的目标而不与框架有太多的冲突?

最佳答案

我终于在 an old PR 找到了答案

def parse():
req = scrapy.Request(...)
delay = 0
reactor.callLater(delay, self.crawler.engine.schedule, request=req, spider=self)

但是,蜘蛛可能会因为过早空闲而退出。基于过时的中间件 https://github.com/ArturGaspar/scrapy-delayed-requests ,这可以用
from scrapy import signals
from scrapy.exceptions import DontCloseSpider

class ImmortalSpiderMiddleware(object):

@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_idle, signal=signals.spider_idle)
return s

@classmethod
def spider_idle(cls, spider):
raise DontCloseSpider()

最后一个选项,由 ArturGaspar 更新中间件,导致:
from weakref import WeakKeyDictionary

from scrapy import signals
from scrapy.exceptions import DontCloseSpider
from twisted.internet import reactor

class DelayedRequestsMiddleware(object):
requests = WeakKeyDictionary()

@classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.spider_idle, signal=signals.spider_idle)
return ext

@classmethod
def spider_idle(cls, spider):
if cls.requests.get(spider):
spider.log("delayed requests pending, not closing spider")
raise DontCloseSpider()

def process_request(self, request, spider):
delay = request.meta.pop('delay_request', None)
if delay:
self.requests.setdefault(spider, 0)
self.requests[spider] += 1
reactor.callLater(delay, self.schedule_request, request.copy(),
spider)
raise IgnoreRequest()

def schedule_request(self, request, spider):
spider.crawler.engine.schedule(request, spider)
self.requests[spider] -= 1

并且可以在解析中使用,例如:
yield Request(..., meta={'delay_request': 5})

关于scrapy - 在scrapy中延迟请求,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46698333/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com