performance - Spark "first"窗口函数花费的时间比 "last"长得多-6ren

performance - Spark "first"窗口函数花费的时间比 "last"长得多

转载作者：行者123 更新时间：2023-12-04 12:53:34

我正在使用 pyspark 例程来插入配置表中的缺失值。
想象一个从 0 到 50,000 的配置值表。用户指定介于两者之间的几个数据点(例如 0、50、100、500、2000、500000)，然后我们对余数进行插值。我的解决方案主要遵循 this blog post非常接近，除了我没有使用任何 UDF。
在对它的性能进行故障排除时(大约需要 3 分钟)，我发现一个特定的窗口函数正在占用所有时间，而我所做的其他一切只需要几秒钟。
这是主要感兴趣的领域 - 我使用窗口函数来填充上一个和下一个用户提供的配置值:

from pyspark.sql import Window, functions as F

# Create partition windows that are required to generate new rows from the ones provided
win_last = Window.partitionBy('PORT_TYPE', 'loss_process').orderBy('rank').rowsBetween(Window.unboundedPreceding, 0)
win_next = Window.partitionBy('PORT_TYPE', 'loss_process').orderBy('rank').rowsBetween(0, Window.unboundedFollowing)

# Join back in the provided config table to populate the "known" scale factors
df_part1 = (df_scale_factors_template
  .join(df_users_config, ['PORT_TYPE', 'loss_process', 'rank'], 'leftouter')
  # Add computed columns that can lookup the prior config and next config for each missing value
  .withColumn('last_rank', F.last( F.col('rank'),         ignorenulls=True).over(win_last))
  .withColumn('last_sf',   F.last( F.col('scale_factor'), ignorenulls=True).over(win_last))
).cache()
debug_log_dataframe(df_part1 , 'df_part1') # Force a .count() and time Part1

df_part2 = (df_part1
  .withColumn('next_rank', F.first(F.col('rank'),         ignorenulls=True).over(win_next))
  .withColumn('next_sf',   F.first(F.col('scale_factor'), ignorenulls=True).over(win_next))
).cache()
debug_log_dataframe(df_part2 , 'df_part2') # Force a .count() and time Part2

df_part3 = (df_part2
  # Implements standard linear interpolation: y = y1 + ((y2-y1)/(x2-x1)) * (x-x1)
  .withColumn('scale_factor', 
              F.when(F.col('last_rank')==F.col('next_rank'), F.col('last_sf')) # Handle div/0 case
              .otherwise(F.col('last_sf') + ((F.col('next_sf')-F.col('last_sf'))/(F.col('next_rank')-F.col('last_rank'))) * (F.col('rank')-F.col('last_rank'))))
  .select('PORT_TYPE', 'loss_process', 'rank', 'scale_factor')
).cache()
debug_log_dataframe(df_part3, 'df_part3', explain: True) # Force a .count() and time Part3

上面曾经是一个单一的链式数据帧语句，但我已经将它分成了 3 部分，以便我可以隔离需要很长时间的部分。结果是:

Part 1: Generated 8 columns and 300006 rows in 0.65 seconds

Part 2: Generated 10 columns and 300006 rows in 189.55 seconds

Part 3: Generated 4 columns and 300006 rows in 0.24 seconds

为什么我的电话是 first()在 Window.unboundedFollowing花费比 last() 更长的时间在 Window.unboundedPreceding ?

避免问题/疑虑的一些注意事项:

debug_log_dataframe只是一个辅助函数，用于使用 .Count() 强制执行数据帧/缓存并计时以产生上述日志。

我们实际上一次操作 6 个 50001 行的配置表(因此是分区和行数)

作为健全性检查，我排除了 cache() 的影响。明确重用 unpersist()在为后续运行计时之前 - 我对上述测量非常有信心。

实物图 :
为了帮助回答这个问题，我调用 explain()根据第 3 部分的结果，除其他外，确认缓存具有预期效果。这里有注释以突出问题区域:

我能看到的唯一区别是:

前两次调用(到 last )显示 RunningWindowFunction ，而对 next 的调用刚读 Window

第 1 部分旁边有一个 *(3)，但第 2 部分没有。

我尝试过的一些事情 :

我尝试进一步将第 2 部分拆分为单独的数据帧 - 结果是每个 first语句占用总时间的一半(~98 秒)

我尝试颠倒生成这些列的顺序(例如，在调用“first”之后调用“last”)，但没有区别。无论哪个数据帧最终包含对 first 的调用是慢的。

我觉得我已经做了尽可能多的挖掘工作，并且有点希望 Spark 专家能够看看这个时间是从哪里来的。

最佳答案

不回答问题的解决方案
在尝试各种方法来加速我的日常工作时，我想到尝试重写我对 first() 的用法。只是 last() 的用法以相反的排序顺序。
所以重写这个:

win_next = (Window.partitionBy('PORT_TYPE', 'loss_process')
  .orderBy('rank').rowsBetween(0, Window.unboundedFollowing))

df_part2 = (df_part1
  .withColumn('next_rank', F.first(F.col('rank'),         ignorenulls=True).over(win_next))
  .withColumn('next_sf',   F.first(F.col('scale_factor'), ignorenulls=True).over(win_next))
)

像这样:

win_next = (Window.partitionBy('PORT_TYPE', 'loss_process')
  .orderBy(F.desc('rank')).rowsBetween(Window.unboundedPreceding, 0))

df_part2 = (df_part1
  .withColumn('next_rank', F.last(F.col('rank'),         ignorenulls=True).over(win_next))
  .withColumn('next_sf',   F.last(F.col('scale_factor'), ignorenulls=True).over(win_next))
)

令我惊讶的是，这实际上解决了性能问题，现在整个数据帧在短短 3 秒内生成。我很高兴，但仍然很烦恼。
正如我预测的那样，查询计划现在在创建接下来的两列之前包括一个新的 SORT 步骤，并且它们已经从 Window 改变了。至 RunningWindowFunction作为前两个。这是新计划(不再将代码分解为 3 个单独的缓存部分，因为那只是为了解决性能问题):

至于问题:

Why do my calls to first() over Window.unboundedFollowing take so much longer than last() over Window.unboundedPreceding?

出于学术原因，我希望有人仍然可以回答这个问题

关于performance - Spark "first"窗口函数花费的时间比 "last"长得多，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69308560/

文章推荐： javascript - 使用 React 管理无限滚动表中的复选框状态

文章推荐： z3 - 约束规划网状网络

文章推荐： r - 使用 TukeyHSD 的输出自动向 ggplot 条形图添加重要字母

arrays - 花费 O(1) 时间的数组访问是否可以改进？
我一直在读一本分配给类(class)的书，它提到数组访问需要 O(1) 时间。我意识到这非常快(也许尽可能快)，但是如果您有一个循环必须多次引用它，那么分配一个临时变量以在数组中查找值有什么好处吗？或
MySQL - 查询性能问题 - 花费 25 秒以上
我一直试图找出为什么这个查询花了这么长时间。以前，它的执行时间约为 150 毫秒到 200 毫秒，但现在需要 25 秒或更长时间。这是从昨晚到今天之间的事。唯一改变的就是将数据添加到表中。根据下面的
javascript - ng-if 花费 500 个观察者 - 性能
我有一个 ng repeat 重复数据。 - data.image(src)部分为null，src=null的不再重复。我用一个简单的 ng-if 解决了它。
phpunit --path-coverage(分支覆盖)花费 100 倍以上的时间
我有一个包含大量测试的 Laravel 项目。我正在使用 pcov 来计算代码覆盖率，大约需要 4 分钟。但是 pcov 不支持分支覆盖，所以我决定使用 xdebug。使用 xdebug 测试执行，
c# - AutoMapper 花费 4 秒绘制 19 个对象
我已经被这个问题困扰了一段时间了，我被难住了。 Automapper 需要 4 秒来映射 19 个对象。在我的机器(24GB 内存，3.6Ghz i7)上，该操作应该花费毫秒或纳秒。这是映射调用。
phpunit --path-coverage(分支覆盖)花费 100 倍以上的时间
我有一个包含大量测试的 Laravel 项目。我正在使用 pcov 来计算代码覆盖率，大约需要 4 分钟。但是 pcov 不支持分支覆盖，所以我决定使用 xdebug。使用 xdebug 测试执行，
java - TCP 连接比 ping 花费 X100 更长的时间
我在机器 A 上有一个 java 进程通过 TCP 与机器 B 上的 Tomcat 通信。 TCP 连接(只是 syn-syn/ack 交换)始终需要 100 毫秒的数量级，而 ping 请求需要 1
asp.net - 花费 200 万条记录的 Asp.Net GridView
我做了一项任务，从 sqlserver 获取超过 200 万条记录并将它们填充到 Asp.net GridView 中。问题是，查询需要超过 2 分钟才能获得记录，而我的查询现在已经完全优化。当我
javascript - 花费 X 秒并将其转换为 h :m:s 的最有效/最短方法
我希望将 165 秒变成 2:40 而不是 0:2:45 函数需要能够适应秒值的大小。我知道有无数种方法可以做到这一点，但我正在寻找一种干净的方法来做到这一点，除了 jQuery 之外没有任何外部库

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

performance - Spark "first"窗口函数花费的时间比 "last"长得多