python - 用Scrapy向回调函数传递参数，这样可以在以后崩溃时接收参数-6ren

python - 用Scrapy向回调函数传递参数，这样可以在以后崩溃时接收参数

转载作者：太空宇宙更新时间：2023-11-03 13:12:05

我试图让这个蜘蛛工作，如果请求组件被单独抓取它工作，但是当稍后尝试使用 Srapy 回调函数接收参数时我崩溃了。目标是抓取多个页面并抓取数据，同时以以下格式写入输出 json 文件:

作者 |相册 |标题 |歌词

每个数据都位于单独的网页上，所以这就是为什么我要使用 Scrapy 回调函数来完成它。

此外，上述每个项目都在 Scrapy items.py 下定义为:

import scrapy

class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
    author = scrapy.Field()
    album = scrapy.Field()
    title = scrapy.Field()
    lyrics = scrapy.Field()

蜘蛛代码从这里开始:

import scrapy
import re
import json

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tutorial.items import TutorialItem


# urls class
class DomainSpider(scrapy.Spider):
    name = "domainspider"
    allowed_domains = ['www.domain.com']
    start_urls = [
        'http://www.domain.com',
    ]

    rules = (
        Rule(LinkExtractor(allow='www\.domain\.com/[A-Z][a-zA-Z_/]+$'), 
            'parse', follow=True,
        ),
    )

    # Parsing start here
    # crawling and scraping the links from menu list
    def parse(self, response):
        links = response.xpath('//html/body/nav[1]/div/ul/li/div/a/@href')

        for link in links:
            next_page_link = link.extract()
            if next_page_link:
                next_page = response.urljoin(next_page_link)
                yield scrapy.Request(next_page, callback=self.parse_artist_page)

    # crawling and scraping artist names and links
    def parse_artist_page(self, response):
        artist_links = response.xpath('//*/div[contains(@class, "artist-col")]/a/@href')
        author = response.xpath('//*/div[contains(@class, "artist-col")]/a/text()').extract()

        item = TutorialItem(author=author)

        for link in artist_links:
            next_page_link = link.extract()
            if next_page_link:
                next_page = response.urljoin(next_page_link)
                yield scrapy.Request(next_page, callback=self.parse_album_page)

                request.meta['author'] = item
                yield item
                return

    # crawling and scraping album names and links
    def parse_album_page(self, response):
        album_links = response.xpath('//*/div[contains(@id, "listAlbum")]/a/@href')
        album = response.xpath('//*/div[contains(@class, "album")]/b/text()').extract()

        item = TutorialItem(album=album)

        for link in album_links:
            next_page_link = link.extract()
            if next_page_link:
                next_page = response.urljoin(next_page_link)
                yield scrapy.Request(next_page, callback=self.parse_lyrics_page)

                request.meta['album'] = item
                yield item
                return

    # crawling and scraping titles and lyrics
    def parse_lyrics_page(self, response):
        title = response.xpath('//html/body/div[3]/div/div[2]/b/text()').extract()
        lyrics = map(unicode.strip, response.xpath('//html/body/div[3]/div/div[2]/div[6]/text()').extract())

        item = response.meta['author', 'album']
        item = TutorialItem(author=author, album=album, title=title, lyrics=lyrics)
        yield item

调用回调函数时代码崩溃:

request.meta['author'] = item
yield item
return

有人能帮忙吗？

最佳答案

我确实找到了问题所在，我设置的回调函数的方式现在可以工作了:

# crawling and scraping artist names and links
    def parse_artist_page(self, response):
        artist_links = response.xpath('//*/div[contains(@class, "artist-col")]/a/@href')
        author = response.xpath('//*/div[contains(@class, "artist-col")]/a/text()').extract()

        for link in artist_links:
            next_page_link = link.extract()
            if next_page_link:
                next_page = response.urljoin(next_page_link)
                request = scrapy.Request(next_page, callback=self.parse_album_page)
                request.meta['author'] = author
                return request

    # crawling and scraping album names and links
    def parse_album_page(self, response):
        author = response.meta.get('author')

        album_links = response.xpath('//*/div[contains(@id, "listAlbum")]/a/@href')
        album = response.xpath('//*/div[contains(@class, "album")]/b/text()').extract()


        for link in album_links:
            next_page_link = link.extract()
            if next_page_link:
                next_page = response.urljoin(next_page_link)
                request = scrapy.Request(next_page, callback=self.parse_lyrics_page)
                request.meta['author'] = author
                request.meta['album'] = album
                return request

    # crawling and scraping song titles and lyrics
    def parse_lyrics_page(self, response):
        author = response.meta.get('author')
        album = response.meta.get('album')

        title = response.xpath('//html/body/div[3]/div/div[2]/b/text()').extract()
        lyrics = map(unicode.strip, response.xpath('//html/body/div[3]/div/div[2]/div[6]/text()').extract())

        item = TutorialItem(author=author, album=album, title=title, lyrics=lyrics)
        yield item

关于python - 用Scrapy向回调函数传递参数，这样可以在以后崩溃时接收参数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41025060/

文章推荐： python - 为什么 python 请求抛出这个 BadStatusLine 异常

文章推荐： python - 如何修改 lambda 函数中的变量？

batch-file - 如何在 BAT For 中获取 token 2 以后
命令 svn status 返回如下内容: ? SomeClient\BUTCHERED.docx M SomeClient\Development notes.txt ?

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 用Scrapy向回调函数传递参数，这样可以在以后崩溃时接收参数