scala - Spark Dataframe 上的 val 与 def 性能-6ren

scala - Spark Dataframe 上的 val 与 def 性能

转载作者：行者123 更新时间：2023-12-04 00:27:55

25

4

以下代码，因此是一个关于性能的问题 - 当然可以大规模想象:

import org.apache.spark.sql.types.StructType

val df = sc.parallelize(Seq(
   ("r1", 1, 1),
   ("r2", 6, 4),
   ("r3", 4, 1),
   ("r4", 1, 2)
   )).toDF("ID", "a", "b")

val ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)

// or

def ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)

df.withColumn("ones", ones).explain

这里在使用 def 和 val 时的两个物理计划 - 它们是相同的:

 == Physical Plan == **def**
 *(1) Project [_1#760 AS ID#764, _2#761 AS a#765, _3#762 AS b#766, (CASE WHEN (_2#761 = 1) THEN 1 ELSE 0 END + CASE WHEN (_3#762 = 1) THEN 1 ELSE 0 END) AS ones#770]
 +- *(1) SerializeFromObject [staticinvoke(class 
 org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true, false) AS _1#760, assertnotnull(input[0, scala.Tuple3, true])._2 AS _2#761, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#762]
   +- Scan[obj#759]


 == Physical Plan == **val**
 *(1) Project [_1#780 AS ID#784, _2#781 AS a#785, _3#782 AS b#786, (CASE WHEN (_2#781 = 1) THEN 1 ELSE 0 END + CASE WHEN (_3#782 = 1) THEN 1 ELSE 0 END) AS ones#790]
 +- *(1) SerializeFromObject [staticinvoke(class 
 org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true, false) AS _1#780, assertnotnull(input[0, scala.Tuple3, true])._2 AS _2#781, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#782]
    +- Scan[obj#779]

所以，有讨论:

val vs def performance.

然后:

我看不出 .explains 有什么不同。好的。
来自其他地方:val 在定义时计算，def - 在调用时计算。
我假设这里使用 val 或 def 没有区别，因为它本质上是在一个循环中并且有一个 reduce。这是正确的吗？
df.schema.map(c => c.name).drop(1) 会在每个数据帧行中执行吗？当然没有必要。 Catalyst 会对此进行优化吗？
如果上述情况是正确的，即每次都执行该语句以处理要处理的列，那么我们如何使该段代码只出现一次？我们是否应该创建一个 val ones = df.schema.map(c => c.name).drop(1)
val，def 不仅仅是 Scala，也是 Spark 组件。

对于 -1er，我这样问，因为以下内容非常清楚，但 val 的内容比下面的代码更多，并且下面的代码没有被迭代:

var x = 2 // using var as I need to change it to 3 later
val sq = x*x // evaluates right now
x = 3 // no effect! sq is already evaluated
println(sq)

最佳答案

这里有两个核心概念，Spark DAG 创建和评估，以及 Scala 的 val vs def 定义，这些是正交的

I see no difference in the .explains

您看不出有什么区别，因为从 Spark 的角度来看，查询是相同的。如果您将图形存储在 val 中或每次使用 def 创建它，对分析器来说并不重要。

From elsewhere: val evaluates when defined, def - when called.

这是 Scala 语义。 val 是一个不可变的引用，它在声明站点被评估一次。一个def代表方法定义，如果你在里面分配一个新的DataFrame，每次调用它都会创建一个。例如:

def ones = 
  df
   .schema
   .map(c => c.name)
   .drop(1)
   .map(x => when(col(x) === 1, 1).otherwise(0))
   .reduce(_ + _)

val firstcall = ones
val secondCall = ones

上面的代码将在 DF 上构建两个单独的 DAG。

I am assuming that it makes no difference whether a val or def is used here as it essentially within a loop and there is a reduce. Is this correct?

我不确定您说的是哪个循环，但请参阅上面的回答了解两者之间的区别。

Will df.schema.map(c => c.name).drop(1) be executed per dataframe row? There is of course no need. Does Catalyst optimize this?

不，drop(1) 将发生在整个数据帧中，这实际上会使其仅删除第一行。

If the above is true in that the statement is executed every time for the columns to process, how can we make that piece of code occur just once? Should we make a val of val ones = df.schema.map(c => c.name).drop(1)

每个数据帧只发生一次(在您的示例中，我们恰好有一个)。

关于scala - Spark Dataframe 上的 val 与 def 性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54857469/

25

4

0

文章推荐： apache-spark - 在 PySpark 数据框中删除包含特定值的行

文章推荐： reactjs - Codemirror 在一行上显示所有 JSON

文章推荐： python - 无法导入 ASGI_APPLICATION 模块 'myproject.routing'

文章推荐： ecmascript-6 - hapijs v17 在插件中发送响应之前设置 header

ruby-on-rails - def self.up，def up，def self.down，def down是什么意思？
有人可以解释一下接下来的事情有什么区别吗？ def self.up 定义 def self.down 下最佳答案 self.up和up包含执行rake db:migrate时由迁移运行的代码。 se
python - 如何调用函数: def(dot) and def(dice) from def(main)?
我在从第三个(主)函数调用前两个函数时遇到问题。我相信我的编码正确(忽略大小和 x，y 坐标)来显示 di 的五边，但似乎无法弄清楚如何调用这些函数。主要目标是显示 di 的五个面。 def dot
python - 可以在 python 类中创建一个 def，即使 def 不存在也可以调用 def 名称
我想做这样的事情: class MyClass(Object): def ****(self): print self.__name __ MyClass.test() ->t
xml - 如何在 Liquibase 中定义一组默认列，def。 PK，def。索引，def。表创建的值？
我只是环顾四周，以减少在 liquibase 上创建表的工作量和错误。是否可以为表创建一组默认列？列: 内部ID 可变字符 UUID 时间戳创建Ts 时间戳更新Ts int 锁定版本约束 ID
scala - 如何键入检查 Def Def
在注释宏中，我枚举了一个类的成员，并且想要找到我找到的方法的类型。所以我很高兴地遍历 body类的，并收集所有DefDef成员。 ...我无法进行类型检查。对于每个 DefDef我尝试将其包装在
scala - def * (def asterisk) 是什么意思？
我正在查看的一些代码中的示例 class X { def k1 = column[Int]("k1") def k2 = column[Int]("k2") def * = (
function - 'def' 和没有 'def' 之间的区别
我是一个时髦的初学者。我很困惑是否使用了“def”。 def str = "hello" print str 对比 str = "hello" print str 从这个例子。结果是一样的。但我想知
ruby - def inside def 或如何做
我想做这样的事情: class Result<
ruby - self.class_eval <
我正在尝试理解这个函数。我看到的是一个属性和类型被传递给了 opal () 方法。然后type_name取值自 type只要type是 Symbol或 String .否则，name在 type

ruby - `def +@` 和 `def -@` 是什么意思？
在此Haskell-like comprehensions implementation in Ruby有一些我在 Ruby 中从未见过的代码: class Array def +@ #
python - 如何仅运行一次具有多个值的 def 并将值使用到另一个 def 中？
我的问题非常简单，但不幸的是我找不到解决方法。我想运行一个 def A，它仅从 def B 返回多个值一次。我写了这段代码: def A(): x = 1 y = 2 z
Scalameta Decl.Def 不适用于 trait def 方法
我在 def 声明中使用 Scalameta(v1.8.0) 注释: trait MyTrait { @MyDeclDef def f2(): Int } 定义的注释类只返回输入，如下所示:
performance - `private[this] def` 何时比 `private def` 具有性能优势？
写private[this] def与 private def 相比，在性能噪声比方面是有意义的?我知道这对 private[this] val 有影响超过 private val因为前者允许 sca
Groovy: "def"中 "def x = 0"的用途是什么？
在下面的代码段(取自 Groovy Semantics Manual page )中，为什么要在赋值前加上关键字 def ？ def x = 0 def y = 5 while ( y-- > 0 )
gradle - Gradle中 “def”和 “static def”之间的区别
作为标题，Groovy中这两个定义的确切区别是什么？也许是文档问题，我什么也找不到... 最佳答案没有static的方法声明将方法标记为实例方法。带有static的声明将使此方法静态-可以在不创建
javascript - svg defs def 在 d3 中的可变位置
我正在使用 d3.js 强制导向图。它有节点和连接它们的链接。为了创建箭头，我使用 svg 和 d3 组合起来，如下所示: gA.svg.append('defs').selectAll('m
python - 为什么类有 Def Run() 和 Def Execute()？
我经常看到包含 def execute(self) 和 def run() 的类 python 会像 C++ 中的 int main() 那样自动获取它吗？最佳答案 Python 是一种解释型语言，
如果 def 名称是 toString，Scala 隐式 def 不起作用
此代码无法编译: object Foo { implicit def toString(i: Int): String = i.toString def foo(x: String)
scala - 如何访问隐式 "implicit"即 def a[A :B] or def a[A <% B]?
例如我需要在函数 def a[A:ClassManifest] 中访问 list 获得删除类。我可以使用 Predef.implicitly 函数，但在这种情况下，我的代码将与我使用完整形式 def
scala - Scala如何知道 “def foo”和 “def foo()”之间的区别？
我知道scala中的空参数方法和无参数方法之间的用法差异，我的问题与生成的类文件有关。当我在javap中查看这两个类时，它们看起来完全相同: class Foo { def bar() = 123

首页

博学

6Ren·AI

商城

scala - Spark Dataframe 上的 val 与 def 性能