python - Scrapy yield 项目作为 JSON 中的子项目-6ren

python - Scrapy yield 项目作为 JSON 中的子项目

转载作者：太空宇宙更新时间：2023-11-04 10:02:53

26

4

我如何告诉 Scrapy 将所有生成的项目分成两个列表？例如，假设我有两种主要类型的项目 - article 和 author。我想将它们放在两个单独的列表中。现在我正在获取输出 JSON:

[
  {
    "article_title":"foo",
    "article_published":"1.1.1972",
    "author": "John Doe"
  },
  {
    "name": "John Doe",
    "age": 42,
    "email": "foo@example.com"
  }
]

我如何将它转换成这样的东西？

{
  "articles": [
    {
      "article_title": "foo",
      "article_published": "1.1.1972",
      "author": "John Doe"
    }
  ],
  "authors": [
    {
      "name": "John Doe",
      "age": 42,
      "email": "foo@example.com"
    }
  ]
}

我输出这些的函数很简单，类似这样:

def parse_author(self, response):
        name = response.css('div.author-info a::text').extract_first()
        print("Parsing author: {}".format(name))

        yield {
            'author_name': name
        }

最佳答案

项目将分别到达管道并使用此设置相应地添加每个项目:

项目.py

class Article(scrapy.Item):
    title = scrapy.Field()
    published = scrapy.Field()
    author = scrapy.Field()

class Author(scrapy.Item):
    name = scrapy.Field()
    age = scrapy.Field()

蜘蛛.py

def parse(self, response):

    author = items.Author()
    author['name'] = response.css('div.author-info a::text').extract_first()
    print("Parsing author: {}".format(author['name']))
    yield author

    article = items.Article()
    article['title'] = response.css('article css').extract_first()
    print("Parsing article: {}".format(article['title']))

    yield article

管道.py

process_item(self, item, spider):
    if isinstance(item, items.Author):
        # Do something to authors
    elif isinstance(item, items.Article):
        # Do something to articles

我建议通过这种架构:

[{
    "title": "foo",
    "published": "1.1.1972",
    "authors": [
        {
        "name": "John Doe",
        "age": 42,
        "email": "foo@example.com"
        },
        {
        "name": "Jane Doe",
        "age": 21,
        "email": "bar@example.com"
        },
    ]
}]

这使它成为一个项目。

项目.py

class Article(scrapy.Item):
    title = scrapy.Field()
    published = scrapy.Field()
    authors = scrapy.Field()

蜘蛛.py

def parse(self, response):

    authors = []
    author = {}
    author['name'] = "John Doe"
    author['age'] = 42
    author['email'] = "foo@example.com"
    print("Parsing author: {}".format(author['name']))
    authors.append(author)

    article = items.Article()
    article['title'] = "foo"
    article['published'] = "1.1.1972"
    print("Parsing article: {}".format(article['title']))
    article['authors'] = authors
    yield article

关于python - Scrapy yield 项目作为 JSON 中的子项目，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42610814/

26

4

0

文章推荐： c - 如何在终止线程之前将线程数据复制到数组？

文章推荐： java - 使用 geomesa-accumulo 摄取 GeoTIFF

文章推荐： html - 右浮动在我的情况下不起作用，我该怎么办？

文章推荐： linux - 在目录中迭代以在 linux 中备份 sqlite 表

javascript - ES6 yield (yield 1)(yield 2)(yield 3)()
function* generatorFunction() { yield (yield 1)(yield 2)(yield 3)(); } var iterator = generatorFun
javascript - 生成器函数中的委托(delegate) yield (yield star、yield *)
ECMAScript 6 应该带来生成器函数和迭代器。生成器函数(具有 function* 语法)返回一个迭代器。迭代器有一个 next 方法，当重复调用时，该方法会执行生成器函数的主体，并在每个 y
javascript - 生成器函数中的委托(delegate) yield (yield star, yield *)
ECMAScript 6 应该引入生成器函数和迭代器。生成器函数(具有 function* 语法)返回迭代器。迭代器有一个 next 方法，当重复调用时，它会执行生成器函数的主体，在每个 yield
python 发电机: yield and yield from
关闭。这个问题需要details or clarity .它目前不接受答案。想改进这个问题吗？通过 editing this post 添加细节并澄清问题. 关闭 2 年前。 Improve t
python - yield (yield) 有什么作用？
自 python 2.5 以来，可以将 send()、throw()、close() 放入生成器中。在定义的生成器中，可以通过执行以下操作来“捕获”发送的数据: def gen(): whil
php - 在 PHP : what is the difference between "return", "yield", "yield from"中，在同一个函数中混合了 yield 和 return？
return的区别和 yield似乎很清楚，直到我发现还有 yield from以及将两者结合起来的可能性 return和 yield在完全相同的功能! 我对return的理解之后的一切都是不是执
ruby-on-rails - Rails yield 和 content_for wieird 行为， `yield :filter` 仅在放置在默认 yield 之后才有效
假设我有这个部分，我正在尝试渲染 #layouts/_subheader.html.erb 当我在这样的 View 中使用这个部分时 Content For Yield
yield - Roslyn 是否将编译器生成的 yield 转换暴露给语法树？
yield操作符是由编译器在底层实现的，该编译器生成一个实现符合 IEnumerable 的状态机的类。和 IEnumerator . 给定一个罗斯林 MethodDeclarationSyntax
php - "yield"覆盖 "yield from"项
$item) echo "$index $item" . PHP_EOL; } resolve(generator1()); echo PHP_EOL; resolve(gener
python - `yield from` 生成器与 `yield from` 列表性能
这个问题在这里已经有了答案: Why converting list to set is faster than converting generator to set? (1 个回答) List c
python - 深入——yield from inside yield
是否有一个单行代码来获取生成器并生成该生成器中的所有元素？例如: def Yearly(year): yield YEARLY_HEADER for month in range(1, 13)
python - Yield 和 yield from - 你能把它们结合起来吗？
刚发现yield from 结构，在我看来这有点像反向的yield，而不是从生成器中获取对象，您插入/将对象发送到生成器。喜欢: def foo(): while True:
python - yield 中的 yield 有什么作用？
考虑以下代码: def mygen(): yield (yield 1) a = mygen() print(next(a)) print(next(a)) 输出产量: 1 None 解释器
python - 协程 yield 与任务 yield
Guido van Rossum，在 2014 年关于 Tulip/Asyncio 的演讲中 shows the slide : Tasks vs coroutines Compare: res =
ruby - yield self 和 yield 的区别？
谁能帮我理解“yield self”和“yield”的区别？ class YieldFirstLast attr_accessor :first, :last def initiali
php - Blade 模板，@yield 中的@yield()
这是我目前使用 Laravel 5 实现的 Open Graph 标签: app.blade.php @yield('title') page.blade.php @extends('app'
python - Tornado的 "yield"和asyncio的 "yield from"在机制上的区别？
在 Tornado 中，我们通常会编写如下代码来异步调用函数: class MainHandler(tornado.web.RequestHandler): @tornado.gen.coro
aQute.bnd.indexer.analyzers.Yield.yield()方法的使用及代码示例
本文整理了Java中aQute.bnd.indexer.analyzers.Yield.yield()方法的一些代码示例，展示了Yield.yield()的具体用法。这些代码示例主要来源于Github
r - 将日 yield 转换为 r 中的月 yield
我们有超过 100 个共同基金的每日返回，我们希望将这些返回转换为月度返回。每月返回不应是每个月的平均值，而是每个月末的资金返回。基金在不同的时间点开始和结束，它们需要自己保留(不是每个月的共同基金
scala - 使用 Scala 延续实现 yield ( yield 返回)
如何实现 C# yield return使用 Scala 延续？我希望能够编写 Scala Iterator s 风格相同。在 this Scala news post 的评论中有刺伤，但它不起作用(

首页

博学

6Ren·AI

商城

python - Scrapy yield 项目作为 JSON 中的子项目