gpt4 book ai didi

python - scrapy 中的项目 vs 项目加载器

转载 作者:太空狗 更新时间:2023-10-29 17:43:53 25 4
gpt4 key购买 nike

我是 scrapy 的新手,我知道项目用于填充抓取的数据,但我无法理解项目和项目加载器之间的区别。我试图阅读一些示例代码,他们使用项目加载器来存储而不是项目,我不明白为什么。 Scrapy 文档对我来说不够清晰。任何人都可以就何时使用项目加载器以及它们为项目提供哪些额外设施给出一个简单的解释(更好的例子)?

最佳答案

我真的很喜欢文档中的官方解释:

Item Loaders provide a convenient mechanism for populating scraped Items. Even though Items can be populated using their own dictionary-like API, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.

In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.

最后一段应该回答你的问题。
项目加载器很棒,因为它们允许您拥有如此多的处理快捷方式并重用一堆代码来保持一切整洁、干净和易于理解。

比较示例案例。假设我们要抓取这个项目:

class MyItem(Item):
full_name = Field()
bio = Field()
age = Field()
weight = Field()
height = Field()

仅项目方法看起来像这样:

def parse(self, response):
full_name = response.xpath("//div[contains(@class,'name')]/text()").extract()
# i.e. returns ugly ['John\n', '\n\t ', ' Snow']
item['full_name'] = ' '.join(i.strip() for i in full_name if i.strip())
bio = response.xpath("//div[contains(@class,'bio')]/text()").extract()
item['bio'] = ' '.join(i.strip() for i in full_name if i.strip())
age = response.xpath("//div[@class='age']/text()").extract_first(0)
item['age'] = int(age)
weight = response.xpath("//div[@class='weight']/text()").extract_first(0)
item['weight'] = int(age)
height = response.xpath("//div[@class='height']/text()").extract_first(0)
item['height'] = int(age)
return item

与项目加载器方法对比:

# define once in items.py 
from scrapy.loader.processors import Compose, MapCompose, Join, TakeFirst
clean_text = Compose(MapCompose(lambda v: v.strip()), Join())
to_int = Compose(TakeFirst(), int)

class MyItemLoader(ItemLoader):
default_item_class = MyItem
full_name_out = clean_text
bio_out = clean_text
age_out = to_int
weight_out = to_int
height_out = to_int

# parse as many different places and times as you want
def parse(self, response):
loader = MyItemLoader(selector=response)
loader.add_xpath('full_name', "//div[contains(@class,'name')]/text()")
loader.add_xpath('bio', "//div[contains(@class,'bio')]/text()")
loader.add_xpath('age', "//div[@class='age']/text()")
loader.add_xpath('weight', "//div[@class='weight']/text()")
loader.add_xpath('height', "//div[@class='height']/text()")
return loader.load_item()

如您所见,Item Loader 更加简洁且易于扩展。假设您有 20 个以上的字段,其中很多字段共享相同的处理逻辑,如果没有 Item Loader,那将是自杀。项目加载器很棒,你应该使用它们!

关于python - scrapy 中的项目 vs 项目加载器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39127256/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com