scala - 什么是无类型 Scala UDF 和有类型 Scala UDF？它们的区别是什么？-6ren

scala - 什么是无类型 Scala UDF 和有类型 Scala UDF？它们的区别是什么？

转载作者：行者123 更新时间：2023-12-03 23:42:11

28

4

我已经使用 Spark 2.4 一段时间了，最近几天才开始切换到 Spark 3.0。切换到 Spark 3.0 运行后出现此错误 udf((x: Int) => x, IntegerType) :

Caused by: org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution;

解决方案是由 Spark 本身提出的，在谷歌搜索一段时间后，我进入了 Spark 迁移指南页面:

In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default. Remove the return type parameter to automatically switch to typed Scala udf is recommended, or set spark.sql.legacy.allowUntypedScalaUDF to true to keep using it. In Spark version 2.4 and below, if org.apache.spark.sql.functions.udf(AnyRef, DataType) gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") returns null in Spark 2.4 and below if column x is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.

source: Spark Migration Guide

我注意到我通常的使用方式 function.udf API，即 udf(AnyRef, DataType) ，被称为 UnTyped Scala UDF以及建议的解决方案，即 udf(AnyRef) ，被称为 Typed Scala UDF .

根据我的理解，第一个看起来比第二个更严格的类型，其中第一个明确定义了其输出类型而第二个没有，因此我对为什么它被称为 UnTyped 感到困惑。

该函数也被传递给 udf , 即 (x:Int) => x ，显然已定义其输入类型，但 Spark 声称 You're using untyped Scala UDF, which does not have the input type information ?

我的理解正确吗？即使经过更深入的搜索，我仍然找不到任何 Material 来解释什么是 UnTyped Scala UDF 以及什么是 Typed Scala UDF。
所以我的问题是:它们是什么？它们的区别是什么？

最佳答案

在类型化 Scala UDF 中，UDF 知道作为参数传递的列的类型，而在非类型化 Scala UDF 中，UDF 不知道作为参数传递的列的类型
创建类型化 scala UDF 时，作为参数传递的列类型和 UDF 的输出是从函数参数和输出类型推断出来的，而在创建非类型化 scala UDF 时，根本没有类型推断，无论是参数还是输出。
令人困惑的是，在创建类型化 UDF 时，类型是从函数推断出来的，而不是作为参数显式传递的。更明确地说，您可以按如下方式编写类型化 UDF 创建:

val my_typed_udf = udf[Int, Int]((x: Int) => Int)

现在，让我们看看你提出的两点。

To my understanding, the first one (eg udf(AnyRef, DataType)) looks more strictly typed than the second one (eg udf(AnyRef)) where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.

根据 spark functions scaladoc , 签名 udf对于第一个函数，将函数转换为 UDF 的函数实际上是:

def udf(f: AnyRef, dataType: DataType): UserDefinedFunction

对于第二个:

def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction

所以第二个实际上比第一个更类型化，因为第二个考虑了作为参数传递的函数的类型，而第一个删除了函数的类型。
这就是为什么在第一个你需要定义返回类型，因为 spark 需要这个信息，但不能从作为参数传递的函数推断它，因为它的返回类型被删除，而在第二个中，返回类型是从作为参数传递的函数推断出来的争论。

Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?

这里重要的不是函数，而是 Spark 如何从这个函数创建 UDF。
在这两种情况下，要转换为 UDF 的函数都定义了其输入和返回类型，但是在使用 udf(AnyRef, DataType) 创建 UDF 时，这些类型会被删除并且不会被考虑在内。 .

关于scala - 什么是无类型 Scala UDF 和有类型 Scala UDF？它们的区别是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65121888/

28

4

0

文章推荐： git - VSCode - 选项卡图标颜色反射(reflect) git 状态

文章推荐： android-studio - android studio 4.2 beta 1 对 jetpack compose 的支持

文章推荐： c++ - 为哈希函数定义 c++20 概念

文章推荐： python - 在代理下 pip SSLError WRONG_VERSION_NUMBER

scala - Scala 可以在参数中允许自由类型参数吗(Scala 类型参数是一等公民吗？)？
我有一些 Scala 代码，它用两个不同版本的类型参数化函数做了一些漂亮的事情。我已经从我的应用程序中简化了很多，但最后我的代码充满了形式 w(f[Int],f[Double]) 的调用。哪里w()是
scala - Scala 脚本可以引用同一目录中其他未编译的 scala 代码吗？
如果我在同一目录中有两个单独的未编译的 scala 文件: // hello.scala object hello { def world() = println("hello world") }
scala - Spark Scala 获取类未找到 scala.Any
val schema = df.schema val x = df.flatMap(r => (0 until schema.length).map { idx => ((idx, r.g
scala - 错误 : scala: No 'scala-library*.jar' in Scala compiler library
环境: Play 2.3.0/Scala 2.11.1/IntelliJ 13.1 我使用 Typesafe Activator 1.2.1 用 Scala 2.11.1 创建一个新项目。项目创建好后
scala - 如何使我的 Scala jar 库中的类可以在 Scala 控制台和 Scala 脚本中访问？
我只是想知道如何使用我自己的类扩展 Scala 控制台和“脚本”运行程序，以便我可以通过使用实际的 Scala 语言与其通信来实际使用我的代码？我应将 jar 放在哪里，以便无需临时配置即可从每个 S
scala - ensime scala 错误(未找到类 scala.Array，未找到对象 scala)
我已经根据 README.md 文件安装了 ensime，但是，我在低级 ensime-server 缓冲区中出现以下错误: 信息: fatal error :scala.tools.nsc.Miss
scala - Scala 中的函数相等，是 Scala 中的函数对象吗？
我正在阅读《Scala 编程》一书。在书中，它说“一个函数文字被编译成一个类，当在运行时实例化时它是一个函数值”。并且它提到“函数值是对象，因此您可以根据需要将它们存储在变量中”。所以我尝试检查函数
scala - 如何在 Scala 原生应用程序中运行 Scala 测试？
我有 hello world scala native 应用程序，想对此应用程序运行小型 scala 测试我使用通常的测试命令，但它抛出异常: NativeMain.scala object Nati
scala - 从 Scala 编译器插件生成 Scala 代码树
有few resources在网络上，在编写与代码模式匹配的 Scala 编译器插件方面很有指导意义，但这些对生成代码(构建符号树)没有帮助。我应该从哪里开始弄清楚如何做到这一点？ (如果有比手动构建
scala - 使用仅适用于较旧 Scala 版本的 Scala 库
我是 Scala 的新手。但是，我用创建了一个中等大小的程序。斯卡拉 2.9.0 .现在我想使用一个仅适用于的开源库斯卡拉 2.7.7 . 是吗可能在我的 Scala 2.9.0 程序中使用这个
scala - Scala 酸洗是否适用于 Scala 2.11？
有没有办法在 Scala 2.11 中使用 scala-pickling？我在 sonatype 存储库中尝试了唯一的 scala-pickling_2.11 工件，但它似乎不起作用。我收到消息:
scala - 如何从 Scala 本身获取 Scala 版本？
这与命令行编译器选项无关。如何以编程方式获取代码内的 Scala 版本？或者，Eclipse Scala 插件 v2 在哪里存储 scalac 的路径？最佳答案这无需访问 scala-compi
scala - 避免 Scala 内存泄漏 - Scala 构造函数
我正在阅读《Scala 编程》一书，并在第 6 章中的类 Rational 实现中遇到了一些问题。这是我的 Rational 类的初始版本(基于本书) class Rational(numerato
scala - 是否有必要在新的 scala 项目中添加我的自定义 scala 库依赖项？
我是 Scala 新手，我正在尝试开发一个使用自定义库的小项目。我在库内创建了一个mysql连接池。这是我的库的build.sbt organization := "com.learn" name :
scala - 如何在编译 Scala 文件之前在 SBT Build.scala 中运行 Scala 代码？
我正在尝试运行一些 Scala 代码，只是暂时打印出“Hello”，但我希望在 SBT 项目中编译 Scala 代码之前运行 Scala 代码。我发现在 build.sbt 中有以下工作。 compi
scala - maven Scala 插件默认使用什么 Scala 版本？
Here链接到 maven Scala 插件使用。但没有提到它使用的究竟是什么 Scala 版本。我创建了具有以下配置的 Maven Scala 项目: org.scala-tools
scala - Scala 上的类型不匹配用于理解 : scala. concurrent.Future
我对 Scala 还很陌生，请多多包涵。我有一堆包裹在一个大数组中的 future 。 future 已经完成了查看几 TB 数据的辛勤工作，在我的应用程序结束时，我想总结上述 future 的所有结
scala - 带有 scala 宏的非 scala 源位置
我有一个 scala 宏，它依赖于通过包含其位置的静态字符串指定的任意 xml 文件。 def myMacro(path: String) = macro myMacroImpl def myMacr
scala - 缺少扩展函数的参数类型 (Scala)
这是我的功能: def sumOfSquaresOfOdd(in: Seq[Int]): Int = { in.filter(_%2==1).map(_*_).reduce(_+_) } 为什么我
scala - Scala 中两个时间戳之间的秒数差异
这个问题在这里已经有了答案: Calculating the difference between two Java date instances (45 个答案) 关闭 5 年前。所以我有一个这

首页

博学

6Ren·AI

商城

scala - 什么是无类型 Scala UDF 和有类型 Scala UDF？它们的区别是什么？