gpt4 book ai didi

python - 如何在 Scrapy 中按所需顺序或同步爬取?

转载 作者:太空狗 更新时间:2023-10-30 01:24:29 28 4
gpt4 key购买 nike

问题

我正在尝试创建一个蜘蛛,它可以从商店中抓取和抓取所有产品,并将结果输出到 JSON 文件,其中包括进入主页中的每个类别并抓取每个产品(仅名称和价格),每个产品类别页面都包含无限滚动。

我的问题是,每次我在抓取一类项目的第一页后发出请求,而不是从同一类型中获取下一批项目,我从下一个类别中获取项目,输出结束是一团糟。

我已经尝试过的

我已经尝试过搞乱设置,将并发请求强制为一个,并为每个请求设置不同的优先级。

我发现了异步抓取,但我不知道如何按顺序创建请求。

代码

import scrapy
from scrapper_pccom.items import ScrapperPccomItem

class PccomSpider(scrapy.Spider):
name = 'pccom'
allowed_domains = ['pccomponentes.com']
start_urls = ['https://www.pccomponentes.com/componentes']

#Scrapes links for every category from main page
def parse(self, response):
categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href')
prio = 20
for category in categories:
url = response.urljoin(category.extract())
yield scrapy.Request(url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio})
prio = prio - 1

#Scrapes products from every page of each category
def parse_item_list(self, response, prio):

products = response.xpath('//article[contains(@class,"tarjeta-articulo")]')
for product in products:
item = ScrapperPccomItem()
item['name'] = product.xpath('@data-name').extract()
item['price'] = product.xpath('@data-price').extract()
yield item

#URL of the next page
next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first()
if next_page:
next_url = response.urljoin(next_page)
yield scrapy.Request(next_url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio})

输出与预期

它的作用:类别 1 第 1 页 > 类别 2 第 1 页 > 类别 3 第 1 页 > ...

我想要它做什么:Cat 1 page 1 > Cat 1 page 2 > Cat 1 page 3 > ... > Cat 2 page 1

最佳答案

这很简单,

获取all_categories中所有类别的列表,现在不抓取所有链接,只抓取第一个类别链接,一旦该类别的所有页面都被抓取,然后发送请求到另一个类别链接.

这是代码,我没有运行代码所以可能有一些语法错误,但是你需要的是逻辑

class PccomSpider(scrapy.Spider):
name = 'pccom'
allowed_domains = ['pccomponentes.com']
start_urls = ['https://www.pccomponentes.com/componentes']

all_categories = []

def yield_category(self):
if self.all_categories:
url = self.all_categories.pop()
print("Scraping category %s " % (url))
return scrapy.Request(url, self.parse_item_list)
else:
print("all done")


#Scrapes links for every category from main page
def parse(self, response):
categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href')

self.all_categories = list(response.urljoin(category.extract()) for category in categories)
yield self.yield_category()


#Scrapes products from every page of each category
def parse_item_list(self, response, prio):

products = response.xpath('//article[contains(@class,"tarjeta-articulo")]')
for product in products:
item = ScrapperPccomItem()
item['name'] = product.xpath('@data-name').extract()
item['price'] = product.xpath('@data-price').extract()
yield item

#URL of the next page
next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first()
if next_page:
next_url = response.urljoin(next_page)
yield scrapy.Request(next_url, self.parse_item_list)

else:
print("All pages of this category scraped, now scraping next category")
yield self.yield_category()

关于python - 如何在 Scrapy 中按所需顺序或同步爬取?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57808974/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com