gpt4 book ai didi

python - scrapy - 如果跟随无限网站则终止抓取

转载 作者:行者123 更新时间:2023-12-05 08:07:49 26 4
gpt4 key购买 nike

假设我有一个类似 this 的网页.

计数器.php

if(isset($_GET['count'])){
$count = intval($_GET['count']);
$previous = $count - 1;
$next = $count + 1;
?>
<a href="?count=<?php echo $previous;?>">< Previous</a>

Current: <?php echo $count;?>

<a href="?count=<?php echo $next;?>">Next ></a>
<?
}

?>

这是一个“无限”的网站,因为您只需点击下一步即可转到下一页(计数器只会增加)或上一页等。

但是,如果我想像这样使用 scrapy 抓取此页面并跟踪链接,scrapy 将永远不会停止抓取。

示例蜘蛛:

urls = []  
class TestSpider(CrawlSpider):
name = 'test'
allowed_domains = ['example.com']
start_urls = ['http://example.com/counter?count=1']


rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)

def parse_item(self, response):
urls.append(response.url)

我可以使用什么样的机制来确定我是否确实陷入了无限网站并需要摆脱它?

最佳答案

如果页面上没有 ITEMS,或者没有 NEXT PAGE 按钮,你总是可以分页,这意味着分页已经结束

class TestSpider(CrawlSpider):
name = 'test'
allowed_domains = ['example.com']

def start_requests(self):
page = 1
yield Request("http://example.com/counter?page=%s" % (page), meta={"page": page}, callback=self.parse_item)

def parse_item(self, response):

#METHOD 1: check if items availble on this page
items = response.css("li.items")

if items:
#Now go to next page
page = int(response.meta['page']) + 1
yield Request("http://example.com/counter?page=%s" % (page), meta={"page": page}, callback=self.parse_item)
else:
logging.info("%s was last page" % response.url)

#METHOD 2: check if this page has NEXT PAGE button, most websites has that
nextPage = response.css("a.nextpage")

if nextPage:
#Now go to next page
page = int(response.meta['page']) + 1
yield Request("http://example.com/counter?page=%s" % (page), meta={"page": page}, callback=self.parse_item)
else:
logging.info("%s was last page" % response.url)

关于python - scrapy - 如果跟随无限网站则终止抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53033631/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com