gpt4 book ai didi

apache-spark - Spark 。如何开始理解下面的执行计划

转载 作者:行者123 更新时间:2023-12-02 19:51:37 28 4
gpt4 key购买 nike

我想了解下面的物理计划。但我几乎没有疑问

== Physical Plan ==
*(13) Project [brochure_click_uuid#32, brochure_id#88L, page#36L, duration#188L]
+- *(13) BroadcastHashJoin [brochure_click_uuid#32], [brochure_click_uuid#87], Inner, BuildRight
:- *(13) HashAggregate(keys=[brochure_click_uuid#32, page#36L], functions=[sum(duration#142L)])
: +- Exchange hashpartitioning(brochure_click_uuid#32, page#36L, 200)
: +- *(11) HashAggregate(keys=[brochure_click_uuid#32, page#36L], functions=[partial_sum(duration#142L)])
: +- Union
: :- *(5) Project [brochure_click_uuid#32, page#36L, CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END AS duration#142L]
: : +- *(5) Filter ((isnotnull(event#34) && NOT (event#34 = EXIT_VIEW)) && isnotnull(CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END))
: : +- Window [lead(date_time#48, 1, null) windowspecdefinition(brochure_click_uuid#32, date_time#48 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS _we0#143], [brochure_click_uuid#32], [date_time#48 ASC NULLS FIRST]
: : +- *(4) Sort [brochure_click_uuid#32 ASC NULLS FIRST, date_time#48 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(brochure_click_uuid#32, 200)
: : +- Union
: : :- *(1) Project [brochure_click_uuid#32, cast(date_time#33 as timestamp) AS date_time#48, page#36L, event#34]
: : : +- *(1) Filter isnotnull(brochure_click_uuid#32)
: : : +- *(1) FileScan json [brochure_click_uuid#32,date_time#33,event#34,page#36L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint>
: : :- *(2) Project [brochure_click_uuid#6, cast(date_time#7 as timestamp) AS date_time#20, page#10L, event#8]
: : : +- *(2) Filter isnotnull(brochure_click_uuid#6)
: : : +- *(2) FileScan json [brochure_click_uuid#6,date_time#7,event#8,page#10L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/p..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint>
: : +- *(3) Project [brochure_click_uuid#60, cast(date_time#61 as timestamp) AS date_time#74, page#64L, event#62]
: : +- *(3) Filter isnotnull(brochure_click_uuid#60)
: : +- *(3) FileScan json [brochure_click_uuid#60,date_time#61,event#62,page#64L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint>
: +- *(10) Project [brochure_click_uuid#32, (page#36L + 1) AS page#166L, CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END AS duration#142L]
: +- *(10) Filter ((((isnotnull(event#34) && isnotnull(page_view_mode#37)) && NOT (event#34 = EXIT_VIEW)) && (page_view_mode#37 = DOUBLE_PAGE_MODE)) && isnotnull(CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END))
: +- Window [lead(date_time#48, 1, null) windowspecdefinition(brochure_click_uuid#32, date_time#48 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS _we0#143], [brochure_click_uuid#32], [date_time#48 ASC NULLS FIRST]
: +- *(9) Sort [brochure_click_uuid#32 ASC NULLS FIRST, date_time#48 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(brochure_click_uuid#32, 200)
: +- Union
: :- *(6) Project [brochure_click_uuid#32, cast(date_time#33 as timestamp) AS date_time#48, page#36L, page_view_mode#37, event#34]
: : +- *(6) Filter isnotnull(brochure_click_uuid#32)
: : +- *(6) FileScan json [brochure_click_uuid#32,date_time#33,event#34,page#36L,page_view_mode#37] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint,page_view_mode:string>
: :- *(7) Project [brochure_click_uuid#6, cast(date_time#7 as timestamp) AS date_time#20, page#10L, page_view_mode#11, event#8]
: : +- *(7) Filter isnotnull(brochure_click_uuid#6)
: : +- *(7) FileScan json [brochure_click_uuid#6,date_time#7,event#8,page#10L,page_view_mode#11] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/p..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint,page_view_mode:string>
: +- *(8) Project [brochure_click_uuid#60, cast(date_time#61 as timestamp) AS date_time#74, page#64L, page_view_mode#65, event#62]
: +- *(8) Filter isnotnull(brochure_click_uuid#60)
: +- *(8) FileScan json [brochure_click_uuid#60,date_time#61,event#62,page#64L,page_view_mode#65] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint,page_view_mode:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, true]))
+- *(12) Project [brochure_id#88L, brochure_click_uuid#87]
+- *(12) Filter isnotnull(brochure_click_uuid#87)
+- *(12) FileScan json [brochure_click_uuid#87,brochure_id#88L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/b..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,brochure_id:bigint>

我有以下问题

  1. 哪个是头哪个是尾,即从哪里开始并进一步遍历。
  2. 哪个是头哪个是尾,即从哪里开始并进一步遍历
  3. 每行开头的数字是什么,例如 (13)、(11)、(5)。
  4. 有些行的开头是 +-,有些是 :-。有什么区别,什么时候 +- 打印,什么时候 :- 在一行之前打印
  5. 列表项
  6. 级联线的含义是什么,例如如下。

.

:        +- Union
: :- *(5) Project [brochure_click_uuid#32, page#36L, CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END AS duration#142L]
: : +- *(5) Filter ((isnotnull(event#34) && NOT (event#34 = EXIT_VIEW)) && isnotnull(CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END))
: : +- Window [lead(date_time#48, 1, null) windowspecdefinition(brochure_click_uuid#32, date_time#48 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS _we0#143], [brochure_click_uuid#32], [date_time#48 ASC NULLS FIRST]
: : +- *(4) Sort [brochure_click_uuid#32 ASC NULLS FIRST, date_time#48 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(brochure_click_uuid#32, 200)
  1. 有垂直线形成:连接两条线。如果这些线是什么意思。这两个步骤如何相互关联

--- 回答后更新 --

所以在上面的查询计划或者你提到的那些小的,

  1. 如何计算作业的数量(如果可能)、作业阶段以及构成每个作业阶段的步骤。
  2. 一个父节点及其所有子节点是否构成一个工作阶段,如果您提到的运算符(operator)在同一级别有多个子节点意味着有多个阶段可以到达父节点。
  3. 最后,您在回答开始时提到有很多文件扫描,这是因为 RDD/Dataframe 被重新计算了吗?

请提供尽可能详细的解释。我是菜鸟😊但正在努力学习。

最佳答案

让我试着一一回答你的问题:

Which is head and which tail i.e Where to start and traverse further.

查询计划具有树结构,因此您应该问什么是根,什么是叶。叶节点是嵌套最多的节点,在您的例子中是 FileScan json 并且还有更多。所以你开始阅读它们,你应该到达计划顶部的根,在你的情况下它是第一个 Project 运算符(operator)。

What are those numbers at the start of each line eg(13), (11), (5)

它是 codegenStageId。在物理规划阶段,Spark 为规划中的运算符(operator)生成 java 代码。直接引用Spark源码吧:

The codegenStageCounter generates ID for codegen stages within a query plan. This ID is used to help differentiate between codegen stages. It is included as a part of the explain output for physical plans. The ID makes it obvious that not all adjacent codegen'd plan operators are of the same codegen stage.

另外,星号 * 表示 Spark 生成了代码。

some lines have +- at start and some have :-. Whats is the difference and when +- get printed and when :- get printed before a line

有些运算符有更多的子运算符,例如 Union、BroadcastHashJoin 或 SortMergeJoin(还有其他运算符)。在这种情况下,此类运算符的子级在计划中显示如下:

Union
:- Project ...
: +- here can be child of project
:
+- Project ...
+- here can be child of project

所以这个计划意味着这两个项目都是 Union 运算符的子项,因此它们在树中处于同一级别。

Whats the meaning of cascading lines

这些级联

+- Project
+- Filter
+- Window

只是表示此FilterProject 的子项,Window 是Filter 的子项等等。它是一棵树,它将停在没有 child 的叶节点处。在您的计划中,叶子是 FileScan json

There are vertical line formed with : connecting two lines.What is the meaning if these lines. How are the two steps connected are related to each other

正如我在上面解释的那样,由 : 组成的垂直线用于连接树中同一级别的运算符。

关于apache-spark - Spark 。如何开始理解下面的执行计划,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58048841/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com