gpt4 book ai didi

python - Scrapy: start_requests() 的正确使用方法是什么?

转载 作者:太空狗 更新时间:2023-10-29 20:56:19 25 4
gpt4 key购买 nike

我的爬虫是这样设置的

class CustomSpider(CrawlSpider):
name = 'custombot'
allowed_domains = ['www.domain.com']
start_urls = ['http://www.domain.com/some-url']
rules = (
Rule(SgmlLinkExtractor(allow=r'.*?something/'), callback='do_stuff', follow=True),
)

def start_requests(self):
return Request('http://www.domain.com/some-other-url', callback=self.do_something_else)

它转到/some-other-url 而不是/some-url。这里有什么问题? start_urls 中指定的 url 是需要提取链接并通过规则过滤器发送的 url,而 start_requests 中的 url 直接发送到项目解析器,因此不需要通过规则过滤器。

最佳答案

来自documentation for start_requests ,重写 start_requests 意味着忽略 start_urls 中定义的 url。

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests.
[...]
If you want to change the Requests used to start scraping a domain, this is the method to override.

如果你只想从/some-url 抓取,然后删除 start_requests。如果您想从两者中抓取,请将/some-url 添加到 start_urls 列表。

关于python - Scrapy: start_requests() 的正确使用方法是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21701249/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com