gpt4 book ai didi

apache-beam - 科学奥 : groupByKey doesn't work when using Pub/Sub as collection source

转载 作者:行者123 更新时间:2023-12-04 02:52:10 24 4
gpt4 key购买 nike

我更改了 WindowsWordCount example 的来源程序从文本文件到云 Pub/Sub,如下所示。我将莎士比亚文件的数据发布到 Pub/Sub,该数据确实被正确获取,但 .groupByKey 之后的任何转换似乎都不起作用。

sc.pubsubSubscription[String](psSubscription)
.withFixedWindows(windowSize) // apply windowing logic
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue
.withWindow[IntervalWindow]
.swap
.groupByKey
.map {
s =>
println("\n\n\n\n\n\n\n This never prints \n\n\n\n\n")
println(s)
}

最佳答案

将输入从文本文件更改为 PubSub PCollection“无界”。按键分组需要定义聚合触发器,否则分组器将永远等待。这里的数据流文档中提到了这一点: https://cloud.google.com/dataflow/model/group-by-key

Note: Either non-global Windowing or an aggregation trigger is required in order to perform a GroupByKey on an unbounded PCollection. This is because a bounded GroupByKey must wait for all the data with a certain key to be collected; but with an unbounded collection, the data is unlimited. Windowing and/or Triggers allow grouping to operate on logical, finite bundles of data within the unbounded data stream.

If you apply GroupByKey to an unbounded PCollection without setting either a non-global windowing strategy, a trigger strategy, or both, Dataflow will generate an IllegalStateException error when your pipeline is constructed.

不幸的是,Apache Beam 的 Python SDK 似乎还不支持触发器,所以我不确定 python 中的解决方案是什么。

(参见 https://beam.apache.org/documentation/programming-guide/#triggers )

关于apache-beam - 科学奥 : groupByKey doesn't work when using Pub/Sub as collection source,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44628370/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com