gpt4 book ai didi

python - Scrapy 没有抓取所有页面

转载 作者:太空狗 更新时间:2023-10-30 02:13:29 24 4
gpt4 key购买 nike

我正在尝试以非常基本的方式抓取网站。但是 Scrapy 并没有爬取所有的链接。我将按以下方式解释该场景-

main_page.html -> 包含指向 a_page.html、b_page.html、c_page.html 的链接
a_page.html -> 包含指向 a1_page.html、a2_page.html
的链接b_page.html -> 包含指向 b1_page.html、b2_page.html
的链接c_page.html -> 包含指向 c1_page.html、c2_page.html
的链接a1_page.html -> 包含指向 b_page.html 的链接
a2_page.html -> 包含指向 c_page.html 的链接
b1_page.html -> 包含指向 a_page.html 的链接
b2_page.html -> 包含到 c_page.html 的链接
c1_page.html -> 包含指向 a_page.html 的链接
c2_page.html -> 包含指向 main_page.html 的链接

我在 CrawlSpider 中使用以下规则 -

规则(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))

但是爬取结果如下——

DEBUG: Crawled (200) http://localhost/main_page.html> (referer: None) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer: http://localhost/main_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a1_page.html> (referer: http://localhost/a_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b_page.html> (referer: http://localhost/a1_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b1_page.html> (referer: http://localhost/b_page.html) 2011-12-05 09:56:07+0530 [test_spider] INFO: Closing spider (finished)

它并没有抓取所有页面。

注意 - 我已经按照 Scrapy 文档中的指示在 BFO 中进行了爬行。

我错过了什么?

最佳答案

Scrapy 默认会过滤掉所有重复的请求。

您可以使用(示例)来规避此问题:

yield Request(url="test.com", callback=self.callback, dont_filter = True)

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

另见 Request object documentation

关于python - Scrapy 没有抓取所有页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8381082/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com