scala - Spark SQL 嵌套 withColumn-6ren

scala - Spark SQL 嵌套 withColumn

转载作者：行者123 更新时间：2023-12-02 01:17:40

24

4

我有一个 DataFrame，它有多个列，其中一些是结构。像这样的事情

root
 |-- foo: struct (nullable = true)
 |    |-- bar: string (nullable = true)
 |    |-- baz: string (nullable = true)
 |-- abc: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- def: struct (nullable = true)
 |    |    |    |-- a: string (nullable = true)
 |    |    |    |-- b: integer (nullable = true)
 |    |    |    |-- c: string (nullable = true)

我想在 baz 列上应用 UserDefinedFunction，以将 baz 替换为 baz 函数，但我不知道该怎么做。以下是所需输出的示例(请注意，baz 现在是 int)

root
 |-- foo: struct (nullable = true)
 |    |-- bar: string (nullable = true)
 |    |-- baz: int (nullable = true)
 |-- abc: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- def: struct (nullable = true)
 |    |    |    |-- a: string (nullable = true)
 |    |    |    |-- b: integer (nullable = true)
 |    |    |    |-- c: string (nullable = true)

看起来DataFrame.withColumn仅适用于顶级列，但不适用于嵌套列。我正在使用 Scala 来解决这个问题。

有人可以帮我解决这个问题吗？

谢谢

最佳答案

这很简单，只需使用点来选择嵌套结构，例如$“foo.baz”:

case class Foo(bar:String,baz:String)
case class Record(foo:Foo)

val df = Seq(
   Record(Foo("Hi","There"))
).toDF()


df.printSchema

root
 |-- foo: struct (nullable = true)
 |    |-- bar: string (nullable = true)
 |    |-- baz: string (nullable = true)


val myUDF = udf((s:String) => {
 // do something with s 
  s.toUpperCase
})


df
.withColumn("udfResult",myUDF($"foo.baz"))
.show

+----------+---------+
|       foo|udfResult|
+----------+---------+
|[Hi,There]|    THERE|
+----------+---------+

如果你想将 UDF 的结果添加到现有的 struct foo 中，即得到:

root
 |-- foo: struct (nullable = false)
 |    |-- bar: string (nullable = true)
 |    |-- baz: string (nullable = true)
 |    |-- udfResult: string (nullable = true)

有两个选项:

与withColumn:

df
.withColumn("udfResult",myUDF($"foo.baz"))
.withColumn("foo",struct($"foo.*",$"udfResult"))
.drop($"udfResult")

使用选择:

df
.select(struct($"foo.*",myUDF($"foo.baz").as("udfResult")).as("foo"))

编辑:用 UDF 的结果替换结构中的现有属性:不幸的是，这不起作用:

df
.withColumn("foo.baz",myUDF($"foo.baz"))

但可以这样做:

// get all columns except foo.baz
val structCols = df.select($"foo.*")
    .columns
    .filter(_!="baz")
    .map(name => col("foo."+name))

df.withColumn(
    "foo",
    struct((structCols:+myUDF($"foo.baz").as("baz")):_*)
)

关于scala - Spark SQL 嵌套 withColumn，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44831789/

24

4

0

文章推荐： django - 从django自动下载txt文件

文章推荐： java - 将单词转换为字符的整数表示形式的总和

scala - 如何在Spark数据帧中执行条件 "withColumn"？
我有一个数据框(mydf)，如下所示: +---+---+---+---+ | F1| F2| F3| F4| +---+---+---+---+ | t| y4| 5|1.0| | x| y
performance - Spark withColumn 性能
我在spark中写了一些代码如下: val df = sqlContext.read.json("s3n://blah/blah.gz").repartition(200) val newdf = d
apache-spark - 带有函数的pySpark withColumn
我有一个包含 2 列的数据框:account_id 和 email_address，现在我想再添加一列 updated_email_address，我称之为email_address 上的函数以获取
pyspark - 为什么我的代码存储库警告我在 for/while 循环中使用 withColumn ？
我注意到我的代码存储库警告我在 for/while 循环中使用 withColumn 是一种反模式。为什么不推荐这样做？这不是PySpark API的正常使用吗？最佳答案我们在实践中注意到，在 f
java - 在数据集上调用 withColumn 的成本是多少
在我使用 RDD 进行了几个项目之后，我开始使用数据集。我正在使用 Java 进行开发。据我了解，列是不可变的 - 列没有映射函数，映射列的标准方法是使用 withColumn 添加列。我的问题是
python - 如何使用 withcolumn 方法和基于多个条件的过滤器？
这个问题已经有答案了: Multiple condition filter on dataframe (2 个回答) 已关闭 3 年前。我是 Pyspark 新手我有这段代码: df2 = df.
python - pyspark withcolumn 在每行中插入列表
我有一个 df，其中包含一列 type，并且我有两个列表 women = ['0980981', '0987098'] men = ['1234567', '4567854'] 现在我想根据 type
python - Pyspark - withColumn 在调用空数据框时不起作用
我正在为某些要求创建一个空数据框，当我在其上调用 withColumn 函数时，我得到了列，但数据为空，如下所示- schema = StructType([]) df = sqlContext.cr
python - Spark withColumn() 执行幂函数
我有一个包含列“col1”和“col2”的数据框 df。我想创建第三列，它使用其中一列作为指数函数。 df = df.withColumn("col3", 100**(df("col1")))*df(
scala - Spark Scala withColumn getItem
我有一些使用的原型(prototype) Scala 代码 .withColumn("column_name_dod", $"column_name".getItem("dod")) 我知道with
PySpark DataFrame withColumn multiple when 条件
如何在多个 when 条件下实现以下目标。 from pyspark.sql import functions as F df = spark.createDataFrame([(5000, 'US'
scala - Spark 是否对多个 withColumn 的数据进行一次传递？
当多个 withColumn 时，Spark 是执行一次还是多次传递数据？函数是链式的？例如: val dfnew = df.withColumn("newCol1", f1(col("a")))
apache-spark - PySpark:withColumn() 有两个条件和三个结果
我正在使用 Spark 和 PySpark。我正在尝试实现等效于以下伪代码的结果: df = df.withColumn('new_column', IF fruit1 == fruit2 T
scala - Spark SQL 嵌套 withColumn
我有一个 DataFrame，它有多个列，其中一些是结构。像这样的事情 root |-- foo: struct (nullable = true) | |-- bar: string (n
java - withColumn() 内的 AnalysisException callUDF()
今天早上我们将 Spark 版本从 2.2.0 更新到 2.3.0，我遇到了相当奇怪的问题。我有一个 UDF()，计算 2 点之间的距离 private static UDF4 calcDistan
python - withColumn 中的用户定义函数只调用一次而不是每个 DF 行
我有一个用户定义函数的问题，该函数是为连接来自一个数据帧的值而构建的，该数据帧与来自另一个数据帧的索引值相匹配。以下是我尝试匹配的简化数据框: a_df: +-------+------+ | in
python - withColumn 不允许我使用 max() 函数生成新列
我有这样一个数据集: a = sc.parallelize([[1,2,3],[0,2,1],[9,8,7]]).toDF(["one", "two", "three"]) 我想要一个数据集，它添加一
python - 将数据框列和外部列表传递给 withColumn 下的 udf
我有一个具有以下结构的 Spark 数据框。 bodyText_token 具有标记(已处理/单词集)。我有一个定义关键字的嵌套列表 root |-- id: string (nullable =
scala - 在 DataFrame.withColumn 中，如何使用列的值作为第二个参数的条件？
如果我有一个名为 df 的 DataFrame，它看起来像: +---+---+ | a1+ a2| +---+---+ |foo|bar| |N/A|baz| +---+---+ 我期望: val
pyspark - Chain withColumn 用于在 PySpark 上多次更改一列
我使用的是 UCI 的成人年收入。我有一个数据框，其中一列中有一个类别变量，我想将其分组为不同的类别(一些常见的特征工程)。 df.groupBy('education').count().show

首页

博学

6Ren·AI

商城

scala - Spark SQL 嵌套 withColumn