gpt4 book ai didi

python - 404页10次后如何停止Scrapy抓取?

转载 作者:太空宇宙 更新时间:2023-11-04 06:41:33 27 4
gpt4 key购买 nike

我有一个在两个数字之间抓取页面的项目。我的蜘蛛在下面。它从一个数字开始到另一个数字,然后在这些页面之间抓取。

我想让它在连续 10 个 404 页后停止。但无论如何它都必须保存 CSV 直到停止位置。

额外:是否可以让它把它停止的数字写入另一个文本文件?

这是我的示例日志:

2017-01-25 19:57:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://domain.com/entry/65848514>
{'basligi': [u'murat boz'],
'entry': [u'<a href=https://domain.com/entry/65848514'],
'favori': [u'0'],
'yazari': [u'thrones']}
2017-01-25 19:57:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://domain.com/entry/65848520>
{'basligi': [u'fatih portakal'],
'entry': [u'<a href=https://domain.com/entry/65848520'],
'favori': [u'0'],
'yazari': [u'agamustaf']}
2017-01-25 19:57:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://domain.com/entry/65848525> (referer: None)
2017-01-25 19:57:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://domain.com/entry/65848528> (referer: None)
2017-01-25 19:57:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://domain.com/entry/65848529> (referer: None)
2017-01-25 19:57:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://domain.com/entry/65848527> (referer: None)

还有我的蜘蛛:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from project.items import ProjectItem
from scrapy import Request

class MySpider(BaseSpider):
name = "project"
allowed_domains = ["domain.com"]
start_urls = ["https://domain.com/entry/%d" % i for i in range(65848505,75848535)]


def parse(self, response):

titles = HtmlXPathSelector(response).select('//li')
for title in titles:
item = ProjectItem()
item['favori'] = title.select("//*[@id='entry-list']/li/@data-favorite-count").extract()
item['entry'] = ['<a href=https://domain.com%s'%a for a in title.select("//*[@class='entry-date permalink']/@href").extract()]
item['yazari'] = title.select("//*[@id='entry-list']/li/@data-author").extract()
item['basligi'] = title.select("//*[@id='topic']/h1/@data-title").extract()

return item

最佳答案

有很多方法可以做到这一点,最简单的方法可能是用回调捕获 404 错误,计算它们并在特定条件下引发 CloseSpider 异常。例如:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from project.items import ProjectItem
from scrapy import Request
from scrapy.exceptions import CloseSpider

class MySpider(BaseSpider):
name = "project"
allowed_domains = ["domain.com"]
start_urls = ["https://domain.com/entry/%d" % i for i in range(65848505,75848535)]
handle_httpstatus_list = [404] # to catch 404 with callback
count_404 = 0


def parse(self, response):
if response.status == 404:
self.count_404 += 1
if self.count_404 == 10:
# stop spider on condition
raise CloseSpider('Number of 404 errors exceeded')
return None
else:
self.count_404 = 0
titles = HtmlXPathSelector(response).select('//li')
for title in titles:
item = ProjectItem()
item['favori'] = title.select("//*[@id='entry-list']/li/@data-favorite-count").extract()
item['entry'] = ['<a href=https://domain.com%s'%a for a in title.select("//*[@class='entry-date permalink']/@href").extract()]
item['yazari'] = title.select("//*[@id='entry-list']/li/@data-author").extract()
item['basligi'] = title.select("//*[@id='topic']/h1/@data-title").extract()

return item

更优雅的解决方案是编写自定义下载器中间件来处理这种情况。

P.S.: Left start_urls 因为它在问题中,但是生成 10 000 000 个链接的列表并将其保存在内存中是极大的开销,您应该为 start_urls 使用任一生成器 或覆盖 start_requests

关于python - 404页10次后如何停止Scrapy抓取?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41857860/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com