gpt4 book ai didi

python - 使用 scrapy 抓取多个页面

转载 作者:太空狗 更新时间:2023-10-30 01:58:43 26 4
gpt4 key购买 nike

我正在尝试使用 scrapy 来抓取一个包含多页信息的网站。

我的代码是:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item


class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["http://www.tcgplayer.com/"]
start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]

def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[@class='magicCard']")
for title in titles:
item = Tcgplayer1Item()
item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]

vendor = title.xpath(".//tr[@class='vendor ']")
item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
yield item

我正在尝试抓取所有页面,直到它到达页面末尾...有时页面会比其他页面多,因此很难准确地说出页码结束的位置。

最佳答案

想法是递增pageNumber,直到找不到titles。如果页面上没有 titles - 抛出 CloseSpider停止蜘蛛的异常:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from tcgplayer1.items import Tcgplayer1Item


URL = "http://store.tcgplayer.com/magic/journey-into-nyx?pageNumber=%d"

class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["tcgplayer.com"]
start_urls = [URL % 1]

def __init__(self):
self.page_number = 1

def parse(self, response):
print self.page_number
print "----------"

sel = Selector(response)
titles = sel.xpath("//div[@class='magicCard']")
if not titles:
raise CloseSpider('No more pages')

for title in titles:
item = Tcgplayer1Item()
item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]

vendor = title.xpath(".//tr[@class='vendor ']")
item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
yield item

self.page_number += 1
yield Request(URL % self.page_number)

这个特殊的蜘蛛会抛出所有 8 页数据,然后停止。

希望对您有所帮助。

关于python - 使用 scrapy 抓取多个页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23897669/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com