python - 编辑 : How do I create a "Nested Loop" that returns an item to the original list in Python and Scrapy-6ren

python - 编辑 : How do I create a "Nested Loop" that returns an item to the original list in Python and Scrapy

转载作者：太空宇宙更新时间：2023-11-04 16:24:49

编辑:

好吧，我今天一直在做的就是想弄清楚这个问题，不幸的是，我还没有这样做。我现在拥有的是:

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self,response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        yield scrapy.Request(response.url, callback = self.primary_parse)
        yield scrapy.Request(response.url, callback = self.secondary_parse)

    def primary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        price = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

    def secondary_parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')

        itemlist = []
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()

        for product, price in zip(product, price):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

问题是，我似乎无法进行第二次解析......我只能进行一次解析。

同时或逐步进行两个解析？

原创:

我正在慢慢掌握这个(Python 和 Scrapy)的窍门，但我曾经碰壁过。我想做的是:

有一个摄影零售网站，它列出了这样的产品:

Name of Camera Body
Price

    With Such and Such Lens
    Price

    With Another Such and Such Lens
    Price

我想做的是，抓取信息并将其组织在如下列表中(我可以毫不费力地输出到 csv 文件):

product,price
camerabody1,$100
camerabody1+lens1,$200
camerabody1+lens1+lens2,$300
camerabody2,$150
camerabody2+lens1,$200
camerabody2+lens1+lens2,$250

我当前的爬虫代码:

import scrapy

from Archer.items import ArcherItemGeorges

class georges_spider(scrapy.Spider):
    name = "GEORGES"
    allowed_domains = ["georges.com.au"]
    start_urls = ["http://www.georges.com.au/index.php/digital-slr-cameras/canon-digital-slr-cameras.html?limit=all"]

    def parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div[@class="listing-item"]')
        product = requests.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()
        price = requests.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()       
        subproduct = requests.xpath('.//*[@class="more-views"]/following-sibling::div/a/text()').extract()
        subprice = requests.xpath('.//*[@class="more-views"]/following-sibling::div/text()[2]').extract()

        itemlist = []
        for product, price, subproduct, subprice in zip(product, price, subproduct, subprice):
            item = ArcherItemGeorges()
            item['product'] = product.strip().upper()
            item['price'] = price.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            item['product'] = product + " " + subproduct.strip().upper()
            item['price'] = subprice.strip().replace('$', '').replace(',', '').replace('.00', '').replace(' ', '')
            itemlist.append(item)
        return itemlist

这不符合我的要求，而且我不知道下一步该做什么，我尝试在 for 循环中执行 for 循环，但这没有用，它只是输出了混淆的结果。

另外仅供引用，我的 items.py:

import scrapy

    class ArcherItemGeorges(scrapy.Item):
        product = scrapy.Field()
        price = scrapy.Field()
        subproduct = scrapy.Field()
        subprice = scrapy.Field()

如有任何帮助，我将不胜感激，我正在尽最大努力学习，但作为 Python 的新手，我觉得我需要一些指导。

最佳答案

正如您的直觉所说，您正在抓取的元素的结构似乎要求在循环中循环。稍微重新排列您的代码，您可以获得一个包含所有产品子产品的列表。

我已将 request 重命名为 product 并为清楚起见引入了 subproduct 变量。我想 subproduct 循环可能是您想要弄清楚的那个。

def parse(self, response):
    # Loop all the product elements
    for product in response.xpath('//div[@class="listing-item"]'):
        item = ArcherItemGeorges()
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        item['product'] = product_name
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the raw primary item
        yield item
        # Yield the primary item with its secondary items
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            yield item

当然，你需要将所有大写，价格清理等应用到相应的字段。

简要说明:

一旦页面被下载，parse 方法就会被 Response 对象(HTML 页面)调用。从那个 Response 中，我们必须以 items 的形式提取/抓取数据。在这种情况下，我们要返回产品价格项目的列表。这是yield的魔力表达开始行动。您可以将其视为未完成函数执行的按需 返回，也称为生成器。 Scrapy 将调用 parse 生成器，直到它没有更多的 items 可以生成，因此，没有更多的 items 可以在 Response 中抓取。

注释代码:

def parse(self, response):
    # Loop all the product elements, those div elements with a "listing-item" class
    for product in response.xpath('//div[@class="listing-item"]'):
        # Create an empty item container
        item = ArcherItemGeorges()
        # Scrape the primary product name and keep in a variable for later use
        product_name = product.xpath('.//*[@class="product-shop"]/h5/a/text()').extract()[0].strip()
        # Fill the 'product' field with the product name
        item['product'] = product_name
        # Fill the 'price' field with the scraped primary product price
        item['price'] = product.xpath('.//*[@class="price-box"]/span/span[@class="price"]/text()').extract()[0].strip()
        # Yield the primary product item. That with the primary name and price
        yield item
        # Now, for each product, we need to loop through all the subproducts
        for subproduct in product.xpath('.//*[@class="more-views"]/following-sibling::div'):
            # Let's prepare a new item with the subproduct appended to the previous
            # stored product_name, that is, product + subproduct.
            item['product'] = product_name + ' ' + subproduct.xpath('a/text()').extract()[0].strip()
            # And set the item price field with the subproduct price
            item['price'] = subproduct.xpath('text()[2]').extract()[0].strip()
            # Now yield the composed product + subproduct item.
            yield item

关于python - 编辑 : How do I create a "Nested Loop" that returns an item to the original list in Python and Scrapy，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26281914/

文章推荐： c++ - 有人知道一个好的、免费的 C++ 调试器吗？

文章推荐： c++ - Qt 忽略 const 说明符

文章推荐： javascript - Tiny Table，如何根据两列进行初始排序

文章推荐： java - 链式字符串 - 操作大字符串缓冲区

Java:if-return-if-return vs if-return-elseif-return
询问 unrelated question我有这样的代码: public boolean equals(Object obj) { if (this == obj) retur
javascript - Javascript : Nested Return Statement, return inside Return
在我之前的一个问题中 js: Multiple return in Ternary Operator我询问了有关使用三元运算符返回多个参数的问题。但是现在参数IsActveUser boolean(t
python - 使用 if-return-return 还是 if-else-return 效率更高？
假设我有一个带有 return 的 if 语句。从效率的角度来看，我应该使用 if(A > B): return A+1 return A-1 或 if(A > B): return
c - return 1, return 0, return -1 和 exit 的区别？
例如考虑以下代码: int main(int argc,char *argv[]) { int *p,*q; p = (int *)malloc(sizeof(int)*10); q
python - `with return .. return` 是无法访问的代码吗？
PyCharm 对这段代码发出警告，说最后一个返回是不可访问的: def foo(): with open(...): return 1 return 0 如果 ope
c# - ExceptionHandling : If controller method returns json then return json, if View then return Redirect
我想实现这样的目标: 如果在返回 Json 的方法中抛出异常，则返回 new Json(new { success = false, error = "unknown"}); 但如果方法返回 View
javascript - JS 模块 : Difference between directly returning a function in an object and returning a function in an object returning a function
它是多余的，但我正在学习 JS，我想知道它是如何工作的。直接从模块返回函数 let func1 = function () { let test = function () {
java - Spring MVC Controller : what is the difference between "return forward", "return redirect"和 "return jsp file"
我不明白我应该使用什么。我有两页 - intro.jsp(1) 和 booksList.jsp(2)。我为每一页创建了一个 Controller 类。第一页有打开第二页的按钮:
php - $this->return 和 return 的区别
我最近在 Joomla 组件(Kunena，更准确地说是 Kunena)中看到这段代码，那么使用 $this->return VS 简单的 return 语句有什么区别. 我已经用谷歌搜索了代码，但没
c# - 获取枚举器 : return or yield return
我的类实现了 IEnumerable。并且可以编译这两种方式来编写 GetEnumerator 方法: public IEnumerator GetEnumerator() { yield r
java - return() 和简单 return 之间的区别
我只是在编码，我想到了一个简单的想法(显然是问题)，如果我有一个像这样的函数: int fun1(int p){ return(p); } 我有一个这样的函数: int fun1(int p){
javascript - return[] 和 return() 的区别
这个问题在这里已经有了答案: What does the comma operator do in JavaScript? (5 个答案) 关闭 9 年前。 function makeArray
python - "Return"in Function only Returning Value
假设我写了一个 for 循环，它将输出所有数字 1 到 x: x=4 for number in xrange(1,x+1): print number, #Output: 1 2 3 4 现
c++ - return 语句中可以省略 return 关键字吗？
我最近在这个 Apache Axis tutorial example. 中看到了下面的一段代码 int main() { int status = AXIS2_SUCCESS; ax
javascript - return 后跟大括号和 return 后跟下一行大括号的区别
function a(){ return{ bb:"a" } } and function a(){ return { bb:"a" } } 这两个代码有什么区别吗，如果有请
javascript - return 和 return() 有什么区别？
function a() { return 1; } function b() { return(1); } 我在 Chrome 的控制台中测试了上面的代码，都返回了 1。 function c()
python - return，return None，根本不返回？
考虑这三个函数: def my_func1(): print "Hello World" return None def my_func2(): print "Hello World"
Test return value and return(测试返回值和返回)
这可能是一个愚蠢的问题，但我正在努力，如果有一种简明的方法来测试函数的返回结果，如果它不满足条件，则返回该值(即，传递它)。。现在来回答一个可能的问题，是的，我正在寻找的类似于例外提供的东西。然而，作
powershell - 为什么 (return) 和 return 不同？
我正在测试一个函数，并尝试使用 return 来做什么，并在 PowerShell 5.1 和 PwSh 7.1 中偶然发现了一个奇怪的问题，即 return cmdlet似乎不适合在团体中工作: P
python - "return"和 "return None"生成器中的行为差异
这个问题已经有答案了: Return in generator together with yield (2 个回答) Why can't I use yield with return? (5 个回

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 编辑 : How do I create a "Nested Loop" that returns an item to the original list in Python and Scrapy