gpt4 book ai didi

python - 抓取项目列表并将其合并到一个属性中

转载 作者:行者123 更新时间:2023-12-01 07:58:29 25 4
gpt4 key购买 nike

我当前的蜘蛛仅解析产品属性,而不解析item['title']。我如何将它们结合在一起页面示例:

https://universalmotors.ru/motorcycles/lifan/motorcycle-lifan-lf150-13-2017/

我的蜘蛛:

# -*- coding: utf-8 -*-
from scrapy.spiders import SitemapSpider as CrawlSpider
from ..items import DistPracticalItem


class SitemapSpider(CrawlSpider):
name = 'sitemap3'
allowed_domains = ['universalmotors.ru']
sitemap_urls = ['https://universalmotors.ru/sitemap.xml']
# sitemap_follow = ['deal']
# sitemap_rules = [(r'^https?://sz.*deal/[0-8]{1,8}\.html$', 'parse_item')]
sitemap_rules = [('/motorcycles/', 'parse_item')]

def parse_item(self, response):
item = DistPracticalItem()
# item['name'] = response.xpath('//h1[contains(@class,"good__title")]/text()').extract_first()
item['title'] = response.css("h1.good__title::text").extract()
# prop = response.xpath('normalize-space(//tr[@itemprop="additionalProperty"])').extract()
item['price'] = response.css('div.deal-info span.campaign-price').css('::text').extract_first()
# item['comments'] = response.css('div.comment div.total').css('::text').extract()
# return item
# for item in response.xpath('//tr[@itemprop="additionalProperty"]'):
for item in response.xpath('//tr[@itemprop="additionalProperty"]'):
yield {
'name': item.xpath('normalize-space(./*[@class="label_table"])').extract_first(),
'value': item.xpath('normalize-space(./*[@class="value_table"])').extract_first(),
# 'title': response.css("h1.good__title::text").extract()
}

我的目标是获取包含以下属性列表的已抓取项目列表:

Title of the Item 1| Price 1 | Property 1, Property 2, property 3
Title of the Item 2| Price 2 | Property 1, Property 2, property 3
Title of the Item 3| Price 3 | Property 1, Property 2, property 3

最佳答案

您必须生成要抓取的完整项目,您的代码仅生成属性,而不生成标题和价格。

我修改了你的代码,它似乎按预期工作。我删除了该项目包含和一些注释以使其在我的机器上运行。

from scrapy.spiders import SitemapSpider as CrawlSpider


class SitemapSpider(CrawlSpider):
name = 'sitemap3'
allowed_domains = ['universalmotors.ru']
sitemap_urls = ['https://universalmotors.ru/sitemap.xml']
sitemap_rules = [('/motorcycles/', 'parse_item')]

def parse_item(self, response):
item = dict()
item['title'] = response.css("h1.good__title::text").extract_first()
item['price'] = response.css('div.deal-info span.campaign-price').css('::text').extract_first()
item['properties'] = list()
for prop in response.xpath('//tr[@itemprop="additionalProperty"]'):
item['properties'].append(
{
'name': prop.xpath('normalize-space(./*[@class="label_table"])').extract_first(),
'value': prop.xpath('normalize-space(./*[@class="value_table"])').extract_first(),
}
)
yield item

请注意,我正在收集 item 变量内的所有信息,在本例中是一个 dict,在您的变量中是一个 DistPracticalItem .

您最终将得到以下架构:

{
'title': string,
'price': string,
'properties': list of dicts with 'name' and 'value' as strings
}

希望我说得清楚。

关于python - 抓取项目列表并将其合并到一个属性中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55829359/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com