python - 在大型数据集上执行 session 窗口时 Apache Beam 作业失败-6ren

python - 在大型数据集上执行 session 窗口时 Apache Beam 作业失败

转载作者：行者123 更新时间：2023-12-02 03:01:32

28

4

我正在处理一个 Python Apache Beam 作业，该作业在有界数据集上使用 session 窗口。它适用于小型数据集，但当我增加输入数据的大小时，这项工作就会终止。

作业 ID 为 2019-06-10_07_28_32-2942508228086251217 .

elements = (p | 'IngestData' >> beam.io.Read(big_query_source))

        elements | 'AddEventTimestamp' >> beam.ParDo(AddTimestampDoFn()) \
                        | 'SessionWindow' >> beam.WindowInto(window.Sessions(10 * 60)) \
                        | 'CreateTuple' >> beam.Map(lambda row: (row['id'], {'attribute1': row['attribute1'], 'date': row['date']})) \
                        | 'GroupById1' >> beam.GroupByKey() \
                        | 'AggregateSessions' >> beam.ParDo(AggregateTransactions()) \
                        | 'MergeWindows' >> beam.WindowInto(window.GlobalWindows()) \
                        | 'GroupById2' >> beam.GroupByKey() \
                        | 'MapSessionsToLists' >> beam.Map(lambda x: (x[0], [y for y in x[1]])) \
                        | 'BiggestSession' >> beam.ParDo(MaximumSession()) \
                        | "PrepForWrite" >> beam.Map(lambda x: x[1].update({"id": x[0]}) or x[1]) \
                        | 'WriteResult' >> WriteToText(known_args.output)

DoFn 类为

class AddTimestampDoFn(beam.DoFn):
    def process(self, element):
        date = datetime.datetime.strptime(element['date'][:-4], '%Y-%m-%d %H:%M:%S.%f')
        unix_timestamp = float(date.strftime('%s'))
        yield beam.window.TimestampedValue(element, unix_timestamp)


class AggregateTransactions(beam.DoFn):
    def process(self, element, window=beam.DoFn.WindowParam):
        session_count = len(element[1])
        attributes = list(map(lambda row: row['attribute1'], element[1]))
        std = np.std(amounts)

        return [(element[0], {'session_count': session_count, 'session_std': std, 'window_start': window.start
                                                                                    .to_utc_datetime()
                                                                                    .strftime('%d-%b-%Y %H:%M:%S')})]


class MaximumSession(beam.DoFn):
    def process(self, element):
        sorted_counts = sorted(element[1], key = lambda x: x['session_count'], reverse=True)

        return [(element[0], {'session_count': sorted_counts[0]['session_count'], 
                                        'session_std': sorted_counts[0]['session_std'], 
                                        'window_start_time': sorted_counts[0]['window_start']})]

作业失败并给出以下错误:The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:

stackdriver 上的特定工作日志没有任何暗示。我只是得到这些条目的组合:

processing lull for over 431.44 seconds in state process-msecs in step s5

Refusing to split <dataflow_worker.shuffle.GroupedShuffleRangeTracker object at 0x7f82e970cbd0> at '\n\xaaG\t\x00\x01': proposed split position is out of range

Retry with exponential backoff: waiting for 4.69305060273 seconds before retrying lease_work because we caught exception: SSLError: ('The read operation timed out',)

其余条目仅供引用。

该特定工作线程的最新内存使用量为 43413 MB。因为我正在使用 n1-highmem-32机器，我不认为内存可能是一个问题。

在客户端，Cloud Shell，我触发了这项工作，我刚刚得到了很多

INFO:oauth2client.transport:Refreshing due to a 401 (attempt 1/2)
INFO:oauth2client.transport:Refreshing due to a 401 (attempt 2/2)
INFO:oauth2client.transport:Refreshing due to a 401 (attempt 1/2)
INFO:oauth2client.transport:Refreshing due to a 401 (attempt 1/2)
INFO:oauth2client.transport:Refreshing due to a 401 (attempt 2/2)
INFO:oauth2client.transport:Refreshing due to a 401 (attempt 2/2)
INFO:oauth2client.transport:Refreshing due to a 401 (attempt 1/2)
INFO:oauth2client.transport:Refreshing due to a 401 (attempt 2/2)

在作业崩溃之前。

有什么想法吗？

谢谢

最佳答案

默认情况下，如果在 BATCH 模式下出现任何错误，Dataflow 会重试管道 4 次，而在 STREAM 模式下运行时会无限期重试管道。

请在用于管道的计算引擎计算机的堆栈驱动程序中创建仪表板，以分析正在发生多少内存、CPU 消耗和 IO 操作。仔分割析上述因素后，应该提高管道的配置。

请确保所有转换都根据您提供的数据正常工作，并应用异常处理。

关于python - 在大型数据集上执行 session 窗口时 Apache Beam 作业失败，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56531872/

28

4

0

文章推荐： Perl，延迟评估字符串

文章推荐： haskell - 通过 ReaderT 获取带有镜头的元组子集

文章推荐： javascript - 检查 matchAll() 是否在 JavaScript 中定义

apache-beam - Apache Beam 设置自定义事件时间戳和水印
我正在使用 Apache Beam 从 Kafka 读取数据。由于乱序写入 Kafka，我想使用有效负载中的事件时间戳而不是默认的 LogAppendTime。我看到了一个解决方案 Apache B
apache-beam - Apache Beam 中的动态管道
我有一个通用输入请求，其中包含需要转换和保存的输入。如果需要转换生成的输出，我会为它实现一个新的处理器(转换器)。 class Request { Input input; Transform
apache-beam - 从 Apache Beam 管道收集输出并将其显示到控制台
我已经在 Apache Beam 上工作了几天。我想快速迭代我正在工作的应用程序，并确保我正在构建的管道没有错误。在 Spark 中我们可以使用 sc.parallelise当我们应用一些 Actio
apache-beam - 使用 beam 和 tf 变换创建通用句子编码器嵌入时出错
我有一个简单的波束管道，它使用带有 tf 变换的通用句子编码器获取一些文本并获得嵌入。与使用 tf 1 制作的演示非常相似。 import tensorflow as tf import apache
apache-beam - 调用 API 是否违反 Apache Beam 编程模型？
使用Apache Beam丰富数据时，对每个数据项都进行一次API调用会不会出错？ (我是 Apache Beam 的新手) 最佳答案不，但您可以批处理 API 调用以获得更好的性能。查看 this
android - 使用 Android Beam(或 S-Beam)发送大文件
我的任务是为一款应用添加支持，以便通过 Android 上的“NFC”在设备之间传输大型数据文件(数十兆字节)。我知道 Android 上真正的 NFC 非常慢，但我知道 ICS 支持将批量数据传输
android - NFC:S-beam 和 Android beam 有什么区别？
NFC:S-beam 和 Android beam 有什么区别？有人可以解释 Wifi-Direct/Bluetooth 激活和传输数据的确切流程吗？最佳答案在 stackexchanged 上解
python - 数据流 : using beam. combiners 上一个 beam.combiners 的结果
我正在使用 Beam 管道计算流式数据的电话号码频率。我使用的滑动窗口每 5 分钟重复一次，总周期为 15 分钟，因此正如预期的那样，对于某些输入，当输入落在多个窗口中时，我会得到多个输出。计算出现
apache-beam - 如何使用 Apache Beam (Java) 进行异步 Http 调用？
输入的PCollection是http requests，是一个有界数据集。我想在 ParDo 中进行异步 http 调用(Java)，解析响应并将结果放入输出 PCollection 中。我的代码如
apache-beam - 如何使用 Apache Beam (Java) 进行异步 Http 调用？
输入的PCollection是http requests，是一个有界数据集。我想在 ParDo 中进行异步 http 调用(Java)，解析响应并将结果放入输出 PCollection 中。我的代码如
apache-beam-io - 在 Apache Beam 中使用 PAssert containsInAnyOrder 比较对象
在使用 PAssert 为我的光束管道编写单元测试时，管道输出对象很好，但在与以下断言错误进行比较时测试失败: java.lang.AssertionError: Decode pubsub mess
java - 使用 Samza Runner 执行 Beam Pipeline 时出现 org.apache.beam.sdk.util.UserCodeException
我正在尝试从 here 运行 Wordcount 演示与 Samza Runner。这是我的build.gradle plugins { id 'eclipse' id 'java' id
java - Flink runner 上的 Beam : ClassNotFoundException: org. apache.beam.runners.flink.translation.wrappers.streaming.WorkItemKeySelector
我正在尝试使用 Beam 和 Flink runner 设置流处理管道。 Flink 是一个本地 session 部署，包含以下 docker-compose 文件: version: "3" ser
Elixir /Phoenix : Missing beam file elf_format <<"/usr/lib/erlang/lib/hipe-3.11.2/ebin/elf_format.beam"
在尝试编译我的 Phoenix 项目的发行版时，出现以下错误: $ mix release .... ==> Generated .appup for myapp 0.0.1 -> 0.0.2 ===
google-cloud-dataflow - Apache Beam - org.apache.beam.sdk.util.UserCodeException : java. sql.SQLException:无法创建 PoolableConnectionFactory(不支持方法)
我正在尝试使用 Apache beam-dataflow 连接到安装在云实例中的配置单元实例。当我运行它时，出现以下异常。当我使用 Apache Beam 访问此数据库时，就会发生这种情况。我见过很多
google-cloud-platform - 在 mac zsh 终端上安装 apache-beam[gcp] 时出错 - “zsh: no matches found: apache-beam[gcp]”
我正在使用 zsh，并且我已经安装了 gcloud，以便通过我的 Mac 上的本地终端与 GCP 进行交互。我遇到了这个错误“zsh:找不到匹配项:apache-beam[gcp]”。但是，当我在 G
beam search及pytorch的实现方式
主要记录两种不同的beam search版本版本一使用类似层次遍历的方式进行搜索，用队列进行维护，每次循环对当前层的所有节点进行搜索，这些节点每个分别对应topk个节点作为下一层候选节点，取
apache-beam - 每秒调用最大请求数的管道设计
我的目标是创建一个每秒调用后端(云托管)服务最多次数的管道......我该如何实现？背景故事:想象一下后端服务使用单个输入调用并返回单个输出。该服务具有与其关联的配额，允许每秒最大请求数(假设每秒
apache-beam - 如何写入在运行时定义的文件名？
我想写入一个 gs 文件，但在编译时我不知道文件名。它的名称基于在运行时定义的行为。我该如何继续？最佳答案如果你使用 Beam Java，你可以使用 FileIO.writeDynamic()为此
apache-beam - 如何使用Beam读取大型CSV？
我试图弄清楚如何使用Apache Beam读取大型CSV文件。 “大”是指几千兆字节(因此一次将整个CSV读取到内存中是不切实际的)。到目前为止，我已经尝试了以下选项: 使用TextIO.read(

首页

博学

6Ren·AI

商城

python - 在大型数据集上执行 session 窗口时 Apache Beam 作业失败