gpt4 book ai didi

python - scrapy中额外请求的解析结果

转载 作者:行者123 更新时间:2023-12-01 02:52:42 24 4
gpt4 key购买 nike

我正在尝试抓取 lynda.com 类(class)并将其信息存储在 csv 文件中。这是我的代码

# -*- coding: utf-8 -*-
import scrapy
import itertools


class LyndadevSpider(scrapy.Spider):
name = 'lyndadev'
allowed_domains = ['lynda.com']
start_urls = ['https://www.lynda.com/Developer-training-tutorials']

def parse(self, response):
#print(response.url)
titles = response.xpath('//li[@role="presentation"]//h3/text()').extract()
descs = response.xpath('//li[@role="presentation"]//div[@class="meta-description hidden-xs dot-ellipsis dot-resize-update"]/text()').extract()
links = response.xpath('//li[@role="presentation"]/div/div/div[@class="col-xs-8 col-sm-9 card-meta-data"]/a/@href').extract()

for title, desc, link in itertools.izip(titles, descs, links):
#print link
categ = scrapy.Request(link, callback=self.parse2)
yield {'desc': link, 'category': categ}

def parse2(self, response):
#getting categories by storing the navigation info
item = response.xpath('//ol[@role="navigation"]').extract()
return item

我在这里想做的是抓取教程列表的标题和描述,然后导航到 URL 并抓取 parse2 中的类别。

但是,我得到这样的结果:

category,desc
<GET https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html>,https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html
<GET https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html>,https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html
<GET https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html>,https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html

如何访问我想要的信息?

最佳答案

您需要在解析 start_urls 响应的 parse 方法中yield一个 scrapy.Request (而不是产生一个字典)。另外,我宁愿循环遍历类(class)项目并分别提取每个类(class)项目的信息。

我不确定你所说的类别到底是什么意思。我想这些是您可以在类(class)详细信息页面底部的本类(class)涵盖的技能下看到的标签。但我可能是错的。

试试这个代码:

# -*- coding: utf-8 -*-
import scrapy

class LyndaSpider(scrapy.Spider):
name = "lynda"
allowed_domains = ["lynda.com"]
start_urls = ['https://www.lynda.com/Developer-training-tutorials']

def parse(self, response):
courses = response.css('ul#category-courses div.card-meta-data')
for course in courses:
item = {
'title': course.css('h3::text').extract_first(),
'desc': course.css('div.meta-description::text').extract_first(),
'link': course.css('a::attr(href)').extract_first(),
}
request = scrapy.Request(item['link'], callback=self.parse_course)
request.meta['item'] = item
yield request

def parse_course(self, response):
item = response.meta['item']
#item['categories'] = response.css('div.tags a em::text').extract()
item['category'] = response.css('ol.breadcrumb li:last-child a span::text').extract_first()
return item

关于python - scrapy中额外请求的解析结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44561712/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com