gpt4 book ai didi

python - Scrapy CrawlSpider 不抓取第一个着陆页

转载 作者:太空狗 更新时间:2023-10-29 17:37:52 26 4
gpt4 key购买 nike

我是 Scrapy 的新手,我正在做一个抓取练习,我正在使用 CrawlSpider。虽然 Scrapy 框架工作得很好并且它遵循相关链接,但我似乎无法让 CrawlSpider 抓取第一个链接(主页/登陆页面)。相反,它会直接抓取规则确定的链接,但不会抓取链接所在的着陆页。我不知道如何解决这个问题,因为不建议覆盖 CrawlSpider 的解析方法。修改 follow=True/False 也不会产生任何好的结果。这是代码片段:

class DownloadSpider(CrawlSpider):
name = 'downloader'
allowed_domains = ['bnt-chemicals.de']
start_urls = [
"http://www.bnt-chemicals.de"
]
rules = (
Rule(SgmlLinkExtractor(aloow='prod'), callback='parse_item', follow=True),
)
fname = 1

def parse_item(self, response):
open(str(self.fname)+ '.txt', 'a').write(response.url)
open(str(self.fname)+ '.txt', 'a').write(','+ str(response.meta['depth']))
open(str(self.fname)+ '.txt', 'a').write('\n')
open(str(self.fname)+ '.txt', 'a').write(response.body)
open(str(self.fname)+ '.txt', 'a').write('\n')
self.fname = self.fname + 1

最佳答案

只需将回调更改为 parse_start_url 并覆盖它:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class DownloadSpider(CrawlSpider):
name = 'downloader'
allowed_domains = ['bnt-chemicals.de']
start_urls = [
"http://www.bnt-chemicals.de",
]
rules = (
Rule(SgmlLinkExtractor(allow='prod'), callback='parse_start_url', follow=True),
)
fname = 0

def parse_start_url(self, response):
self.fname += 1
fname = '%s.txt' % self.fname

with open(fname, 'w') as f:
f.write('%s, %s\n' % (response.url, response.meta.get('depth', 0)))
f.write('%s\n' % response.body)

关于python - Scrapy CrawlSpider 不抓取第一个着陆页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15836062/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com