gpt4 book ai didi

apache-spark - Spark 高阶函数转换输出结构

转载 作者:行者123 更新时间:2023-12-04 01:02:38 24 4
gpt4 key购买 nike

如何使用 spark 高阶函数将结构数组转换为结构?

数据集:

case class Foo(thing1:String, thing2:String, thing3:String)
case class Baz(foo:Foo, other:String)
case class Bar(id:Int, bazes:Seq[Baz])
import spark.implicits._
val df = Seq(Bar(1, Seq(Baz(Foo("first", "second", "third"), "other"), Baz(Foo("1", "2", "3"), "else")))).toDF
df.printSchema
df.show(false)

我想连接所有 thing1, thign2, thing3 但保留每个 barother 属性。

一个简单的:

scala> df.withColumn("cleaned", expr("transform(bazes, x -> x)")).printSchema
root
|-- id: integer (nullable = false)
|-- bazes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
|-- cleaned: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)

只会把东西复制过来。

所需的连接操作:

 df.withColumn("cleaned", expr("transform(bazes, x -> concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3))")).printSchema

不幸的是,将删除 other 列中的所有值:

 +---+----------------------------------------------------+-------------------------------+
|id |bazes |cleaned |
+---+----------------------------------------------------+-------------------------------+
|1 |[[[first, second, third], other], [[1, 2, 3], else]]|[first::second::third, 1::2::3]|
+---+----------------------------------------------------+-------------------------------+

如何保留这些?试图保留元组:

df.withColumn("cleaned", expr("transform(bazes, x -> (concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3), x.other))")).printSchema

失败:

.AnalysisException: cannot resolve 'named_struct('col1', concat(namedlambdavariable().`foo`.`thing1`, '::', namedlambdavariable().`foo`.`thing2`, '::', namedlambdavariable().`foo`.`thing3`), NamePlaceholder(), namedlambdavariable().`other`)' due to data type mismatch: Only foldable string expressions are allowed to appear at odd position, got: NamePlaceholder; line 1 pos 22;

编辑

期望的输出:

  • 一个包含以下内容的新专栏:

    [[first::second::third, other], [1::2::3,else]

其中保留列 other

最佳答案

In this way, you can achieve your desired output. you cannot directly access other value bcoz foo and other are sharing the same hierarchy. so you need to access other separately.

scala>  df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).show(false)
+---+----------------------------------------------------+------------------------------------------------+
|id |bazes |cleaned |
+---+----------------------------------------------------+------------------------------------------------+

printSchema

scala>  df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).printSchema
root
|-- id: integer (nullable = false)
|-- bazes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
|-- cleaned: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = true)
| | |-- col2: string (nullable = true)

如果您还有任何与此相关的问题,请告诉我。

关于apache-spark - Spark 高阶函数转换输出结构,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57943560/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com