gpt4 book ai didi

arrays - Spark 卡拉: Convert Array of Struct column to String column

转载 作者:行者123 更新时间:2023-12-02 07:23:07 25 4
gpt4 key购买 nike

我有一个列,其类型为从 json 文件推导的 array 。我想将数组 < Struct > 转换为字符串,以便我可以将这个数组列按原样保留在 hive 中并将其作为单个列导出到 RDBMS。

temp.json

{"properties":{"items":[{"invoicid":{"value":"923659"},"job_id":
{"value":"296160"},"sku_id":
{"value":"312002"}}],"user_id":"6666","zip_code":"666"}}

处理:

scala> val temp = spark.read.json("s3://check/1/temp1.json")
temp: org.apache.spark.sql.DataFrame = [properties: struct<items:
array<struct<invoicid:struct<value:string>,job_id:struct<value:string>,sku_id:struct<value:string>>>, user_id: string ... 1 more field>]

scala> temp.printSchema
root
|-- properties: struct (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- invoicid: struct (nullable = true)
| | | | |-- value: string (nullable = true)
| | | |-- job_id: struct (nullable = true)
| | | | |-- value: string (nullable = true)
| | | |-- sku_id: struct (nullable = true)
| | | | |-- value: string (nullable = true)
| |-- user_id: string (nullable = true)
| |-- zip_code: string (nullable = true)


scala> temp.select("properties").show
+--------------------+
| properties|
+--------------------+
|[WrappedArray([[9...|
+--------------------+


scala> temp.select("properties.items").show
+--------------------+
| items|
+--------------------+
|[[[923659],[29616...|
+--------------------+


scala> temp.createOrReplaceTempView("tempTable")

scala> spark.sql("select properties.items from tempTable").show
+--------------------+
| items|
+--------------------+
|[[[923659],[29616...|
+--------------------+

我怎样才能得到这样的结果:

+-----------------------------------------------------------------------------------------+
| items |
+-----------------------------------------------------------------------------------------+
[{"invoicid":{"value":"923659"},"job_id":{"value":"296160"},"sku_id":{"value":"312002"}}] |
+-----------------------------------------------------------------------------------------+

获取数组元素值而不做任何改变。

最佳答案

to_json是您要寻找的功能

import org.apache.spark.sql.functions.to_json:

val df = spark.read.json(sc.parallelize(Seq("""
{"properties":{"items":[{"invoicid":{"value":"923659"},"job_id":
{"value":"296160"},"sku_id":
{"value":"312002"}}],"user_id":"6666","zip_code":"666"}}""")))


df
.select(get_json_object(to_json($"properties"), "$.items").alias("items"))
.show(false)
+-----------------------------------------------------------------------------------------+
|items |
+-----------------------------------------------------------------------------------------+
|[{"invoicid":{"value":"923659"},"job_id":{"value":"296160"},"sku_id":{"value":"312002"}}]|
+-----------------------------------------------------------------------------------------+

关于arrays - Spark 卡拉: Convert Array of Struct column to String column,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44326954/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com