gpt4 book ai didi

python - Scrapy 重复结果

转载 作者:行者123 更新时间:2023-12-01 00:56:17 26 4
gpt4 key购买 nike

我正在尝试解析页面上每个广告的各种数据项,例如 https://www.pistonheads.com/classifieds?Category=used-cars&M=1044&ResultsPerPage=750

我的代码正确捕获了大部分项目。但是,我遇到了两个问题:

  1. 每一行的年份列中的输出都是相同的。尽管 xpathtitle 列中使用的完全相同并且可以正常工作,但还是会发生这种情况。
  2. 在我的输出中,每一行都有一个 Transmission 值,该值不正确,因为并非所有广告都填充了此变量。

对我的代码的一般评论也表示赞赏。也许我应该为此使用 ItemLoaders ? (我还没学会它们是如何工作的)。

import scrapy
from datetime import date


class SuperScraper(scrapy.Spider):
name = 'ss22'

def start_requests(self):
urls = 'https://www.pistonheads.com/classifieds?Category=used-cars&M=1044&ResultsPerPage=750'
yield scrapy.Request(urls, callback = self.parse_data)

def parse_data( self, response ):
advert = response.xpath( '//*[@class="ad-listing"]')
title = advert.xpath( './/*[@class="listing-headline"]//h3/text()' ).extract()
year = advert.xpath( './/*[@class="listing-headline"]//h3/text()' ).extract()
price = advert.xpath( './/*[@class="price"]/text()' ).extract()
mileage = advert.xpath( './/*[contains(@class, "flaticon solid gauge-1")]/following-sibling::text()' ).extract()
mileage = [item.strip() for item in mileage]
mileage = [item.replace(',','') for item in mileage]
mileage = [item.replace(' miles','') for item in mileage]
timestamp = str(date.today()).split('.')[0]
timestamps = [timestamp for i in range(len(title))]
model = response.xpath('//head/title/text()').extract()
model = [item.replace("Used ","") for item in model]
model = [item.replace(" cars for sale with PistonHeads","") for item in model]
models = [model for i in range(len(title))]
transmission = advert.xpath('.//*[contains(@class, "flaticon solid location-pin-4")]/following-sibling::text()').extract()
transmission = [item.strip() for item in transmission]
link = advert.xpath( './/*[@class="listing-headline"]/a/@href' ).extract()
link = ['https:\\www.pistonheads.com' + i for i in link]

for item in zip(timestamps,link,models,title,year,price,mileage,transmission):
price_data = {
'timestamp' : item[0],
'link' :item[1],
'model' : item[2],
'title' : item[3],
'year' : year[4],
'price' : item[5],
'mileage' : item[6],
'transmission' :item[7]

}
yield price_data

最佳答案

  1. 您有'year':year[4],所以是的,它总是会给您相同的值。

  2. 由于您有 70 个传输和 73 个项目,因此 zip 以错误的方式将传输合并到项目。因此,我建议您这样做:

class SuperScraper(scrapy.Spider):
name = 'ss22'

def start_requests(self):
urls = 'https://www.pistonheads.com/classifieds?Category=used-cars&M=1044&ResultsPerPage=750'
yield scrapy.Request(urls, self.parse_data)

def parse_data( self, response ):
model = response.xpath('//head/title/text()').get('')
model = model.replace("Used ", "").replace(" cars for sale with PistonHeads", "")
for row in response.xpath('//*[@class="ad-listing"]'):
transmisson = row.xpath('.//*[contains(@class, "flaticon solid location-pin-4")]/following-sibling::text()').get('')
mileage = row.xpath('.//*[contains(@class, "flaticon solid gauge-1")]/following-sibling::text()').get('')
price_data = {
'timestamp': str(date.today()).split('.')[0],
'link': 'https://www.pistonheads.com' + row.xpath('.//*[@class="listing-headline"]/a/@href').get(''),
'model': model,
'title': row.xpath('.//*[@class="listing-headline"]//h3/text()').get('').strip(),
'year': row.xpath('.//*[@class="listing-headline"]//h3/text()').get(''),
'price': row.xpath('.//*[@class="price"]/text()').get('').strip(),
'mileage': mileage.replace(',', '').replace(' miles', '').strip(),
'transmission': transmisson.strip(),
}
yield price_data

这里我们按项目进行迭代,因此我们永远不会错过该项目是否出现传输。

关于python - Scrapy 重复结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56225394/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com