gpt4 book ai didi

python - 如何解析多个子页面、合并/追加并向上传递到父级?

转载 作者:太空宇宙 更新时间:2023-11-03 20:56:55 25 4
gpt4 key购买 nike

这是我的第一个 scrapy 项目——不可否认,这也是我使用 python 的第一个练习之一。我正在寻找一种方法来抓取多个子页面,将内容合并/附加到单个值,并将数据向后/向上传递到原始父页面。每个父页面的子页面数量也是可变的 - 它可能少至 1,但永远不会是 0(可能对错误处理有帮助?)。此外,子页面可以重复并重新出现,因为它们不是单亲所独有的。我已经设法将父页面元数据向下传递到相应的子页面,但在完成相反的过程中遇到了困难。

这是一个示例页面结构:

Top Level Domain
- Pagination/Index Page #1 (parse recipe links)
- Recipe #1 (select info & parse ingredient links)
- Ingredient #1 (select info)
- Ingredient #2 (select info)
- Ingredient #3 (select info)
- Recipe #2
- Ingredient #1
- Recipe #3
- Ingredient #1
- Ingredient #2
- Pagination/Index Page #2
- Recipe #N
- Ingredient #N
- ...
- Pagination/Index Page #3
- ... continued

我正在寻找的输出(每个食谱)如下所示:

{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": "135 calories",
"recipe_calorie_list": ["60 calories", "70 calories", "5 calories"]
}

我正在从相应的食谱页面中提取每种成分的 URL。我需要从每个成分页面中提取卡路里计数,将其与其他成分的卡路里计数合并,并理想地生成单个项目。由于单一成分并不专属于单一食谱,因此我需要能够在稍后的抓取过程中重新访问成分页面。

(注意 - 这不是真实的例子,因为卡路里数明显根据食谱所需的量而变化)

我发布的代码让我接近我正在寻找的东西,但我必须想象有一种更优雅的方法来解决问题。发布的代码成功地将食谱的元数据向下传递到成分级别,循环遍历成分并附加卡路里计数。由于信息是被传递下来的,所以我在成分层面上做出了让步,并创建了许多食谱重复项(每种成分一个),直到我循环使用最后一种成分。在此阶段,我希望添加成分索引号,以便我可以以某种方式保留每个食谱 URL 具有最大成分索引# 的记录。在我到达这一点之前,我想我应该向这里的专业人士寻求指导。

爬虫代码:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from recipe_scraper.items import RecipeItem

class RecipeSpider(CrawlSpider):
name = 'Recipe'
allowed_domains = ['www.example.com']
start_urls = ['https://www.example.com/recipes/']
rules = (
Rule(
LinkExtractor(
allow=()
,restrict_css=('.pagination')
,unique=True
)
,callback='parse_index_page'
,follow=True
),
)

def parse_index_page(self, response):
print('Processing Index Page.. ' + response.url)
index_url = response.url
recipe_urls = response.css('.recipe > a::attr(href)').getall()
for a in recipe_urls:
request = scrapy.Request(a, callback=self.parse_recipe_page)
request.meta['index_url'] = index_url
yield request

def parse_recipe_page(self, response):
print('Processing Recipe Page.. ' + response.url)
Recipe_url = response.url
Recipe_title = response.css('.Recipe_title::text').extract()[0]
Recipe_posted_date = response.css('.Recipe_posted_date::text').extract()[0]
Recipe_instructions = response.css('.Recipe_instructions::text').extract()[0]
Recipe_ingredients = response.xpath('//ul[@class="ingredients"]//li[@class="ingredient"]/a/text()').getall()
Recipe_ingredient_urls = response.xpath('//ul[@class="ingredients"]//li[@class="ingredient"]/a/@href').getall()
Recipe_calorie_list_append = []
Recipe_calorie_list = []
Recipe_calorie_total = []
Recipe_item = RecipeItem()
Recipe_item['index_url'] = response.meta["index_url"]
Recipe_item['Recipe_url'] = Recipe_url
Recipe_item['Recipe_title'] = Recipe_title
Recipe_item['Recipe_posted_date'] = Recipe_posted_date
Recipe_item['Recipe_instructions'] = Recipe_instructions
Recipe_item['Recipe_ingredients'] = Recipe_ingredients
Recipe_item['Recipe_ingredient_urls'] = Recipe_ingredient_urls
Recipe_item['Recipe_ingredient_url_count'] = len(Recipe_ingredient_urls)
Recipe_calorie_list.clear()
Recipe_ingredient_url_index = 0
while Recipe_ingredient_url_index < len(Recipe_ingredient_urls):
ingredient_request = scrapy.Request(Recipe_ingredient_urls[Recipe_ingredient_url_index], callback=self.parse_ingredient_page, dont_filter=True)
ingredient_request.meta['Recipe_item'] = Recipe_item
ingredient_request.meta['Recipe_calorie_list'] = Recipe_calorie_list
yield ingredient_request
Recipe_calorie_list_append.append(Recipe_calorie_list)
Recipe_ingredient_url_index += 1

def parse_ingredient_page(self, response):
print('Processing Ingredient Page.. ' + response.url)
Recipe_item = response.meta['Recipe_item']
Recipe_calorie_list = response.meta["Recipe_calorie_list"]
ingredient_url = response.url
ingredient_calorie_total = response.css('div.calorie::text').getall()
Recipe_calorie_list.append(ingredient_calorie_total)
Recipe_item['Recipe_calorie_list'] = Recipe_calorie_list
yield Recipe_item
Recipe_calorie_list.clear()

事实上,我不太理想的输出如下(注意卡路里列表):

{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories"]
},
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories", "70 calories"]
},
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories", "70 calories", "5 calories"]
}

最佳答案

一种解决方案是分别抓取食谱和配料,作为不同的项目,然后在抓取完成后进行一些后处理,例如使用常规 Python,根据需要合并食谱和配料数据。这是最有效的解决方案。

或者,您可以从菜谱响应中提取成分 URL,而不是一次生成对所有成分的请求,您可以生成对第一种成分的请求,并将其余成分 URL 保存到新请求 meta,以及食谱项目。收到成分响应后,您将所有需要的信息解析为 meta 并生成对下一个成分 URL 的新请求。当不再有成分 URL 时,您将生成完整的食谱项目。

例如:

def _handle_next_ingredient(self, recipe, ingredient_urls):
try:
return Request(
ingredient_urls.pop(),
callback=self.parse_ingredient,
meta={'recipe': recipe, 'ingredient_urls': ingredient_urls},
)
except IndexError:
return recipe

def parse_recipe(self, response):
recipe = {}, ingredient_urls = []
# [Extract needed data into recipe and ingredient URLs into ingredient_urls]
yield self._handle_next_ingredient(recipe, ingredient_urls)

def parse_ingredient(self, response):
recipe = response.meta['recipe']
# [Extend recipe with the information of this ingredient]
yield self._handle_next_ingredient(recipe, response.meta['ingredient_urls'])

但请注意,如果两个或多个食谱可以具有相同的成分 URL,则您必须将 dont_filter=True 添加到您的请求中,从而对相同成分重复多个请求。如果成分 URL 不是特定于配方的,请认真考虑第一个建议。

关于python - 如何解析多个子页面、合并/追加并向上传递到父级?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55960550/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com