apache-beam - 科学奥 : groupByKey doesn't work when using Pub/Sub as collection source-6ren

apache-beam - 科学奥 : groupByKey doesn't work when using Pub/Sub as collection source

转载作者：行者123 更新时间：2023-12-04 02:52:10

24

4

我更改了 WindowsWordCount example 的来源程序从文本文件到云 Pub/Sub，如下所示。我将莎士比亚文件的数据发布到 Pub/Sub，该数据确实被正确获取，但 .groupByKey 之后的任何转换似乎都不起作用。

sc.pubsubSubscription[String](psSubscription)
  .withFixedWindows(windowSize) // apply windowing logic
  .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
  .countByValue
  .withWindow[IntervalWindow]
  .swap
  .groupByKey
  .map {
    s =>
      println("\n\n\n\n\n\n\n This never prints \n\n\n\n\n")
      println(s)
  }

最佳答案

将输入从文本文件更改为 PubSub PCollection“无界”。按键分组需要定义聚合触发器，否则分组器将永远等待。这里的数据流文档中提到了这一点: https://cloud.google.com/dataflow/model/group-by-key

Note: Either non-global Windowing or an aggregation trigger is required in order to perform a GroupByKey on an unbounded PCollection. This is because a bounded GroupByKey must wait for all the data with a certain key to be collected; but with an unbounded collection, the data is unlimited. Windowing and/or Triggers allow grouping to operate on logical, finite bundles of data within the unbounded data stream.

If you apply GroupByKey to an unbounded PCollection without setting either a non-global windowing strategy, a trigger strategy, or both, Dataflow will generate an IllegalStateException error when your pipeline is constructed.

不幸的是，Apache Beam 的 Python SDK 似乎还不支持触发器，所以我不确定 python 中的解决方案是什么。

(参见 https://beam.apache.org/documentation/programming-guide/#triggers )

关于apache-beam - 科学奥 : groupByKey doesn't work when using Pub/Sub as collection source，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44628370/

24

4

0

文章推荐： inheritance - JAXB 继承 - 解码为注释中声明的子类型？

文章推荐： pdf - ps2pdf : preserve page size

jquery - IE7/8、jQuery ui 可排序和隐藏内容。奥
所以我在 jQuery 可排序方面遇到了奇怪的问题。我有可排序的 li 元素，排序很好，但是在 IE 中，拖动时图像会消失。我相当确定它们的位置很奇怪，但在任何其他浏览器中似乎都不会发生这种情况，其他

首页

博学

6Ren·AI

商城

apache-beam - 科学奥 : groupByKey doesn't work when using Pub/Sub as collection source