gpt4 book ai didi

python - Scrapy 停止抓取但继续爬取

转载 作者:太空宇宙 更新时间:2023-11-03 14:47:06 25 4
gpt4 key购买 nike

我正在尝试从网站的多个页面中抓取不同的信息。直到第十六页,一切正常:页面被抓取、抓取和我数据库中的信息库存,但是在第十六页之后,它停止抓取但继续抓取。我检查了网站,有 470 页的更多信息。 HTML 标签是一样的,所以我不明白为什么它停止抓取。

python :

def url_lister():
url_list = []
page_count = 1
while page_count < 480:
url = 'https://www.active.com/running?page=%s' %page_count
url_list.append(url)
page_count += 1
return url_list

class ListeCourse_level1(scrapy.Spider):
name = 'ListeCAP_ACTIVE'
allowed_domains = ['www.active.com']
start_urls = url_lister()

def parse(self, response):
selector = Selector(response)
for uneCourse in response.xpath('//*[@id="lpf-tabs2-a"]/article/div/div/div/a[@itemprop="url"]'):
loader = ItemLoader(ActiveItem(), selector=uneCourse)
loader.add_xpath('nom_evenement', './/div[2]/div/h5[@itemprop="name"]/text()')
loader.default_input_processor = MapCompose(string)
loader.default_output_processor = Join()
yield loader.load_item()
pass

外壳:

>     2018-01-23 17:22:29 [scrapy.core.scraper] DEBUG: Scraped from <200     
> https://www.active.com/running?page=15>
> {
> 'nom_evenement': 'Enniscrone 10k run & 5k run/walk',
> }
> 2018-01-23 17:22:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=16> (referer: None)
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> --------------------------------------------------
> 2018-01-23 17:22:34 [scrapy.extensions.logstats] INFO: Crawled 17 pages (at 17 pages/min), scraped 155 items (at 155 items/min)
> 2018-01-23 17:22:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=17> (referer: None)
>
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=18> (referer: None)
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=19> (referer: None)

最佳答案

这可能是因为只有 17 个页面包含您要查找的内容,而您指示 Scrapy 访问表单 https://www.active.com/running? 的所有 480 个页面?页面=NNN。更好的方法是检查您访问的每个页面是否有下一页,只有在这种情况下才会产生 Request 到下一页。

因此,我会将您的代码重构为类似(未测试)的代码:

class ListeCourse_level1(scrapy.Spider):
name = 'ListeCAP_ACTIVE'
allowed_domains = ['www.active.com']
base_url = 'https://www.active.com/running'
start_urls = [base_url]

def parse(self, response):
selector = Selector(response)
for uneCourse in response.xpath('//*[@id="lpf-tabs2-a"]/article/div/div/div/a[@itemprop="url"]'):
loader = ItemLoader(ActiveItem(), selector=uneCourse)
loader.add_xpath('nom_evenement', './/div[2]/div/h5[@itemprop="name"]/text()')
loader.default_input_processor = MapCompose(string)
loader.default_output_processor = Join()
yield loader.load_item()
# check for next page link
if response.xpath('//a[contains(@class, "next-page")]'):
next_page = response.meta.get('page_number', 1) + 1
next_page_url = '{}?page={}'.format(base_url, next_page)
yield scrapy.Request(next_page_url, callback=self.parse, meta={'page_number': next_page})

关于python - Scrapy 停止抓取但继续爬取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48406852/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com