gpt4 book ai didi

python - 编辑 : How do I create a "Nested Loop" that returns an item to the original list in Python and Scrapy

转载 作者:太空宇宙 更新时间:2023-11-04 16:24:49 24 4
gpt4 key购买 nike

编辑:

好吧,我今天一直在做的就是想弄清楚这个问题,不幸的是,我还没有这样做。我现在拥有的是:

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
name = "GEORGES"
allowed_domains = ["georges.com.au"]
start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

def parse(self,response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')

yield scrapy.Request(response.url, callback = self.primary_parse)
yield scrapy.Request(response.url, callback = self.secondary_parse)

def primary_parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')

itemlist = []
product = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
price = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

for product, price in zip(product, price):
item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)
return itemlist

def secondary_parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')

itemlist = []
product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()

for product, price in zip(product, price):
item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)
return itemlist

问题是,我似乎无法进行第二次解析......我只能进行一次解析。

同时或逐步进行两个解析?


原创:

我正在慢慢掌握这个(Python 和 Scrapy)的窍门,但我曾经碰壁过。我想做的是:

有一个摄影零售网站,它列出了这样的产品:

Name of Camera Body
Price

With Such and Such Lens
Price

With Another Such and Such Lens
Price

我想做的是,抓取信息并将其组织在如下列表中(我可以毫不费力地输出到 csv 文件):

product,price
camerabody1,$100
camerabody1+lens1,$200
camerabody1+lens1+lens2,$300
camerabody2,$150
camerabody2+lens1,$200
camerabody2+lens1+lens2,$250

我当前的爬虫代码:

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
name = "GEORGES"
allowed_domains = ["georges.com.au"]
start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

def parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div[@class="listing-item"]')
product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()
subproduct = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
subprice = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

itemlist = []
for product, price, subproduct, subprice in zip(product, price, subproduct, subprice):
item = ArcherItemGeorges()
item['product'] = product.strip().upper()
item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
item['product'] = product + " " + subproduct.strip().upper()
item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
itemlist.append(item)
return itemlist

这不符合我的要求,而且我不知道下一步该做什么,我尝试在 for 循环中执行 for 循环,但这没有用,它只是输出了混淆的结果。

另外仅供引用,我的 items.py:

import scrapy

class ArcherItemGeorges(scrapy.Item):
product = scrapy.Field()
price = scrapy.Field()
subproduct = scrapy.Field()
subprice = scrapy.Field()

如有任何帮助,我将不胜感激,我正在尽最大努力学习,但作为 Python 的新手,我觉得我需要一些指导。

最佳答案

正如您的直觉所说,您正在抓取的元素的结构似乎要求在循环中循环。稍微重新排列您的代码,您可以获得一个包含所有产品子产品的列表。

我已将 request 重命名为 product 并为清楚起见引入了 subproduct 变量。我想 subproduct 循环可能是您想要弄清楚的那个。

def parse(self, response):
# Loop all the product elements
for product in response.xpath('//div[@class="listing-item"]'):
item = ArcherItemGeorges()
product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
item['product'] = product_name
item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
# Yield the raw primary item
yield item
# Yield the primary item with its secondary items
for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
yield item

当然,你需要将所有大写,价格清理等应用到相应的字段。

简要说明:

一旦页面被下载,parse 方法就会被 Response 对象(HTML 页面)调用。从那个 Response 中,我们必须以 items 的形式提取/抓取数据。在这种情况下,我们要返回产品价格项目的列表。这是yield的魔力表达开始行动。您可以将其视为未完成函数执行的按需 返回,也称为生成器。 Scrapy 将调用 parse 生成器,直到它没有更多的 items 可以生成,因此,没有更多的 items 可以在 Response 中抓取

注释代码:

def parse(self, response):
# Loop all the product elements, those div elements with a "listing-item" class
for product in response.xpath('//div[@class="listing-item"]'):
# Create an empty item container
item = ArcherItemGeorges()
# Scrape the primary product name and keep in a variable for later use
product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
# Fill the 'product' field with the product name
item['product'] = product_name
# Fill the 'price' field with the scraped primary product price
item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
# Yield the primary product item. That with the primary name and price
yield item
# Now, for each product, we need to loop through all the subproducts
for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
# Let's prepare a new item with the subproduct appended to the previous
# stored product_name, that is, product + subproduct.
item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
# And set the item price field with the subproduct price
item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
# Now yield the composed product + subproduct item.
yield item

关于python - 编辑 : How do I create a "Nested Loop" that returns an item to the original list in Python and Scrapy,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26281914/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com