python - Scrapy:自定义回调不起作用-6ren

python - Scrapy:自定义回调不起作用

转载作者：太空宇宙更新时间：2023-11-03 16:20:14

我不知道为什么我的蜘蛛不工作!我绝对不是一名程序员，所以请善待我!哈哈

背景:我正在尝试使用“Scrapy”抓取与 Indigo 上找到的书籍相关的信息。

问题:我的代码没有输入任何自定义回调...它似乎仅在我使用“解析”作为回调时才起作用。

如果我将代码的“规则”部分中的回调从“parse_books”更改为“parse”，那么我列出所有链接的列表的方法就可以正常工作并打印出所有链接我感兴趣的链接。但是，该方法中的回调(指向“parse_books”)永远不会被调用!奇怪的是，如果我将“parse”方法重命名为其他方法(即 ->“testmethod”)，然后将“parse_books”方法重命名为“parse”，我将在该方法中抓取信息放入元素中效果很好!

我想要实现的目标:我想做的就是输入一个页面，比如说“畅销书”，导航到所有项目的相应项目级页面，并抓取所有与书籍相关的信息。我似乎这两件事都是独立工作的:/

代码!

import scrapy
import json
import urllib
from scrapy.http import Request
from urllib import urlencode
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import urlparse



from TEST20160709.items import IndigoItem
from TEST20160709.items import SecondaryItem



item = IndigoItem()
scrapedItem = SecondaryItem()

class IndigoSpider(CrawlSpider):

    protocol='https://'
    name = "site"
    allowed_domains = [
    "chapters.indigo.ca/en-ca/Books",
    "chapters.indigo.ca/en-ca/Store/Availability/"
    ]

    start_urls = [
         'https://www.chapters.indigo.ca/en-ca/books/bestsellers/',
    ]

    #extractor = SgmlLinkExtractor()s

    rules = (
    Rule(LinkExtractor(), follow = True),
    Rule(LinkExtractor(), callback = "parse_books", follow = True),
    )



    def getInventory (self, bookID):
        params ={
       'pid' : bookID,
       'catalog' : 'books'
        }
        yield Request(
            url="https://www.chapters.indigo.ca/en-ca/Store/Availability/?" + urlencode(params),
            dont_filter = True,
            callback = self.parseInventory
        )



    def parseInventory(self,response):
        dataInventory = json.loads(response.body)

        for entry in dataInventory ['Data']:
            scrapedItem['storeID'] = entry['ID']
            scrapedItem['storeType'] = entry['StoreType']
            scrapedItem['storeName'] = entry['Name']
            scrapedItem['storeAddress'] = entry['Address']
            scrapedItem['storeCity'] = entry['City']
            scrapedItem['storePostalCode'] = entry['PostalCode']
            scrapedItem['storeProvince'] = entry['Province']
            scrapedItem['storePhone'] = entry['Phone']
            scrapedItem['storeQuantity'] = entry['QTY']
            scrapedItem['storeQuantityMessage'] = entry['QTYMsg']
            scrapedItem['storeHours'] = entry['StoreHours']
            scrapedItem['storeStockAvailibility'] = entry['HasRetailStock']
            scrapedItem['storeExclusivity'] = entry['InStoreExlusive']

            yield scrapedItem



    def parse (self, response):
        #GET ALL PAGE LINKS
        all_page_links = response.xpath('//ul/li/a/@href').extract()
        for relative_link in all_page_links:
            absolute_link = urlparse.urljoin(self.protocol+"www.chapters.indigo.ca",relative_link.strip())
            absolute_link = absolute_link.split("?ref=",1)[0]
            request = scrapy.Request(absolute_link, callback=self.parse_books)
            print "FULL link: "+absolute_link

            yield Request(absolute_link, callback=self.parse_books)





    def parse_books (self, response):

        for sel in response.xpath('//form[@id="aspnetForm"]/main[@id="main"]'):
            #XML/HTTP/CSS ITEMS
            item['title']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/h1[@id="product-title"][@class][@data-auto-id]/text()').extract())
            item['authors']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/h2[@class="major-contributor"]/a[contains(@class, "byLink")][@href]/text()').extract())
            item['productSpecs']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/p[@class="product-specs"]/text()').extract())
            item['instoreAvailability']= map(unicode.strip, sel.xpath('//span[@class="stockAvailable-mesg negative"][@data-auto-id]/text()').extract())
            item['onlinePrice']= map(unicode.strip, sel.xpath('//span[@id][@class="nonmemberprice__specialprice"]/text()').extract())
            item['listPrice']= map(unicode.strip, sel.xpath('//del/text()').extract())

            aboutBookTemp = map(unicode.strip, sel.xpath('//div[@class="read-more"]/p/text()').extract())
            item['aboutBook']= [aboutBookTemp]

            #Retrieve ISBN Identifier and extract numeric data
            ISBN_parse = map(unicode.strip, sel.xpath('(//div[@class="isbn-info"]/p[2])[1]/text()').extract())
            item['ISBN13']= [elem[11:] for elem in ISBN_parse]
            bookIdentifier = str(item['ISBN13'])
            bookIdentifier = re.sub("[^0-9]", "", bookIdentifier)


            print "THIS IS THE IDENTIFIER:" + bookIdentifier

            if bookIdentifier:
                yield self.getInventory(str(bookIdentifier))

            yield item

最佳答案

我注意到的第一个问题是您的 allowed_domains 类属性已损坏。它应该包含域(因此得名)。

您的情况下的正确值是:

allowed_domains = [
    "chapters.indigo.ca",  # subdomain.domain.top_level_domain
]

如果您检查蜘蛛日志，您会看到:

DEBUG: Filtered offsite request to 'www.chapters.indigo.ca'

这不应该发生。

关于python - Scrapy:自定义回调不起作用，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38555128/

文章推荐： c# - 通过 ASP.NET Membership.OpenAuth 获取用户个人资料信息

文章推荐： php - 如何实现paypal CreateRecurringPaymentsProfile

文章推荐： php - Paypal 经常性

文章推荐： c# - 如何在 ASP.Net 4 C# 中创建 503 响应 header

实例分析Try {} Catch{} 作用
今天有小伙伴给我留言问到，try{...}catch(){...}是什么意思？它用来干什么？简单的说他们是用来捕获异常的下面我们通过一个例子来详细讲解下
html - 列表社交媒体链接的 ARIA 作用
我正在努力提高网站的可访问性，但我不知道如何在页脚中标记社交媒体链接列表。这些链接指向我在 facecook、twitter 等上的帐户。我不想用 role="navigation" 标记这些链接，因
java.util.Timer SystemTime 作用？
说现在是 6 点，我有一个 Timer 并在 10 点安排了一个 TimerTask。之后，System DateTime 被其他服务(例如 ntp)调整为 9 点钟。我仍然希望我的 TimerTas
php - 什么是 Doctrine hydration 作用？
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用资料或专业知识的支持，但这个问题可能会引发辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the
python入门:argparse浅析 nargs='+'作用
我就废话不多说了，大家还是直接看代码吧~ ? 1
Maven是什么?Maven的概念+作用+仓库的介绍+常用命令的详解
Maven系列1 1.什么是Maven？ Maven是一个项目管理工具，它包含了一个对象模型。一组标准集合，一个依赖管理系统。和用来运行定义在生命周期阶段中插件目标和逻辑。核心功能 Mav

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - Scrapy:自定义回调不起作用