gpt4 book ai didi

python - Scrapy中的Linkextractor,分页和2深度链接

转载 作者:太空宇宙 更新时间:2023-11-03 14:32:49 25 4
gpt4 key购买 nike

我试图了解 linkextractor 在 Scrapy 中是如何工作的。我想要实现的目标:

  • 按照起始页分页

  • 搜索 URL 并扫描某个模式中的所有链接

  • 在找到的链接页面中,点击该页面上与模式匹配的另一个链接并废弃该页面

我的代码:

class ToScrapeMyspider(CrawlSpider):
name = "myspider"
allowed_domains = ["myspider.com"]
start_urls = ["www.myspider.com/category.php?k=766"]
rules = (
Rule(LinkExtractor(restrict_xpaths='//link[@rel="next"]/a'), follow=True),
Rule(LinkExtractor(allow=r"/product.php?p=\d+$"), callback='parse_spider')
)

def parse_spider(self, response):
Request(allow=r"/product.php?e=\d+$",callback=self.parse_spider2)

def parse_spider2(self, response):
#EXTRACT AND PARSE DATA HERE ETC (IS WORKING)

我的分页链接如下所示:

<link rel="next" href="https://myspider.com/category.php?k=766&amp;amp;s=100" >

首先我从restrict_xpaths收到错误

'str' object has no attribute 'iter'

但我想我把事情搞砸了

最佳答案

终于工作了:

rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@rel="next"]',)), follow=True),
Rule(LinkExtractor(allow=('product\.php', )), callback='parse_sider'),
)


BASE_URL = 'https://myspider.com/'

def parse_spy(self, response):
links = response.xpath('//li[@id="id"]/a/@href').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield scrapy.Request(absolute_url, callback=self.parse_spider2)

关于python - Scrapy中的Linkextractor,分页和2深度链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47156621/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com