gpt4 book ai didi

python - Scrapy yield 项目作为 JSON 中的子项目

转载 作者:太空宇宙 更新时间:2023-11-04 10:02:53 26 4
gpt4 key购买 nike

我如何告诉 Scrapy 将所有生成的项目分成两个列表?例如,假设我有两种主要类型的项目 - articleauthor。我想将它们放在两个单独的列表中。现在我正在获取输出 JSON:

[
{
"article_title":"foo",
"article_published":"1.1.1972",
"author": "John Doe"
},
{
"name": "John Doe",
"age": 42,
"email": "foo@example.com"
}
]

我如何将它转换成这样的东西?

{
"articles": [
{
"article_title": "foo",
"article_published": "1.1.1972",
"author": "John Doe"
}
],
"authors": [
{
"name": "John Doe",
"age": 42,
"email": "foo@example.com"
}
]
}

我输出这些的函数很简单,类似这样:

def parse_author(self, response):
name = response.css('div.author-info a::text').extract_first()
print("Parsing author: {}".format(name))

yield {
'author_name': name
}

最佳答案

项目将分别到达管道并使用此设置相应地添加每个项目:

项目.py

class Article(scrapy.Item):
title = scrapy.Field()
published = scrapy.Field()
author = scrapy.Field()

class Author(scrapy.Item):
name = scrapy.Field()
age = scrapy.Field()

蜘蛛.py

def parse(self, response):

author = items.Author()
author['name'] = response.css('div.author-info a::text').extract_first()
print("Parsing author: {}".format(author['name']))
yield author

article = items.Article()
article['title'] = response.css('article css').extract_first()
print("Parsing article: {}".format(article['title']))

yield article

管道.py

process_item(self, item, spider):
if isinstance(item, items.Author):
# Do something to authors
elif isinstance(item, items.Article):
# Do something to articles

我建议通过这种架构:

[{
"title": "foo",
"published": "1.1.1972",
"authors": [
{
"name": "John Doe",
"age": 42,
"email": "foo@example.com"
},
{
"name": "Jane Doe",
"age": 21,
"email": "bar@example.com"
},
]
}]

这使它成为一个项目。

项目.py

class Article(scrapy.Item):
title = scrapy.Field()
published = scrapy.Field()
authors = scrapy.Field()

蜘蛛.py

def parse(self, response):

authors = []
author = {}
author['name'] = "John Doe"
author['age'] = 42
author['email'] = "foo@example.com"
print("Parsing author: {}".format(author['name']))
authors.append(author)

article = items.Article()
article['title'] = "foo"
article['published'] = "1.1.1972"
print("Parsing article: {}".format(article['title']))
article['authors'] = authors
yield article

关于python - Scrapy yield 项目作为 JSON 中的子项目,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42610814/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com