gpt4 book ai didi

Spark UI stage and SQL reporting different task time for single partition(Spark UI Stage和SQL针对单个分区报告不同的任务时间)

转载 作者:bug小助手 更新时间:2023-10-26 21:07:06 26 4
gpt4 key购买 nike



Stages tab


I am trying to analyze a seeming bottleneck in my stages UI:

我正在尝试分析我的Stages用户界面中的一个看似瓶颈:


stages single partition bottleneck


According to the events timeline, I have a single partition skew taking 3 minutes to calculate. This partition corresponds to task 116665.

根据事件时间线,我有一个分区倾斜需要3分钟来计算。该分区对应于任务116665。


Is it actually skew?

它真的是歪曲的吗?


enter image description here


enter image description here


enter image description here


The input sizes are all well distributed. This doesn't seem like data skew. It's only the output that is weirdly much larger.

输入大小都是均匀分布的。这看起来并不像是数据扭曲。奇怪的是,只有产出要大得多。


I have no idea why this is, and it's been my focus to try and resolve this, as I don't want a single partition bottlenecking the stage for 3 minutes. I have it in a few other stages spread across my different applications, so I would like to understand this.

我不知道这是为什么,我的重点是尝试解决这个问题,因为我不想让单个分区在3分钟内成为舞台瓶颈。我在我的不同应用程序中还有其他几个阶段,所以我想了解一下这一点。


SQL tab


While investigating this, I opened the SQL tab and checked out the task 116665:

在研究这一点时,我打开了SQL选项卡,并签出了任务116665:


enter image description here


You can see in the top left that the task is only found 3 times, and I have captured all 3 appearances of the task in the above screenshot.

你可以在左上角看到这个任务只被发现了3次,而我已经在上面的截图中捕捉到了任务的所有3个外观。


Problem:


The task 116665 is involved in 2 steps: an exchange and a ShuffleHashJoin. Each of these take only 7.3 and 37.4 seconds, respectively, for a combined 44.7 seconds—not 3 minutes as stated in the Stages tab.

任务116665涉及两个步骤:交换和ShuffleHashJoin。每个阶段分别只需要7.3秒和37.4秒,总共需要44.7秒,而不是“阶段”选项卡中所述的3分钟。


However, the WholeStageCodeGen descriptor on the far right shows 3.9 minutes for stage 3252 and task 116665. Meanwhile, the stage 3252 only showed 3.7 minutes. If I look for every occurrence of stage 3252 in the SQL tab, it actually adds up to over 5 minutes—well over 3.7 minutes.

然而,最右侧的WholeStageCodeGen描述符显示阶段3252和任务116665的时间为3.9分钟。与此同时,3252阶段只显示了3.7分钟。如果我在SQL选项卡中查找Stage 3252的每一次出现,实际上加起来超过5分钟-远远超过3.7分钟。


enter image description here


Summary



  1. The stages tab shows a single partition taking 3 minutes, but the SQL tab shows the same task only needing 44.7 seconds, a factor of 4x difference!



  2. The stages tab indicates a duration of 3.7 minutes, the SQL tab 3.9 minutes. If I add up all the times in the SQL tab for the stage 3252, it actually comes out to 5.5 minutes.




It seems the UI isn't consistent, like there's a display error.

用户界面似乎不一致,好像有一个显示错误。


My instinct was to assume the WholeStageCodeGen was an aggregate time that summed up all parallel processing; however, the stages tab shows that literally 3 minutes of it comes from a single partition. So clearly it's supposed to come from a single, non-parallelized task, and yet the task time is very different between the Stages and SQL tabs.

我的直觉是认为WholeStageCodeGen是汇总所有并行处理的聚合时间;然而,Stages选项卡显示,实际上有3分钟的时间来自单个分区。因此,很明显,它应该来自单个非并行化任务,但Stages和SQL选项卡之间的任务时间却非常不同。


Question


It seems there are display disparities here that can't be attributed to parallelism. How could this come about? Which do I trust?

这里似乎存在着不能归因于并行的显示差异。这怎么会发生呢?我该相信哪一个?


notes


The code is essentially ~140 lines of a series of left and inner joins written in Spark SQL.

该代码实质上是用Spark SQL编写的一系列左连接和内连接的大约140行。


Spark 3.2.0

Spark 3.2.0


更多回答
优秀答案推荐
更多回答

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com