gpt4 book ai didi

python - PySpark 将数组内的结构字段转换为字符串

转载 作者:行者123 更新时间:2023-12-01 01:15:58 25 4
gpt4 key购买 nike

我有一个具有如下架构的数据框:

|-- order: string (nullable = true)
|-- travel: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- place: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- address: string (nullable = true)
| | | |-- latitude: double (nullable = true)
| | | |-- longitude: double (nullable = true)
| | |-- distance_in_kms: float (nullable = true)
| | |-- estimated_time: struct (nullable = true)
| | | |-- seconds: long (nullable = true)
| | | |-- nanos: integer (nullable = true)

我想获取 estimated_time 中的秒数并将其转换为字符串并与 s 连接,然后将 estimated_time 替换为新的字符串值。例如,{ "seconds": "988", "nanos": "102"} 将转换为 988s,因此架构将更改为

|-- order: string (nullable = true)
|-- travel: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- place: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- address: string (nullable = true)
| | | |-- latitude: double (nullable = true)
| | | |-- longitude: double (nullable = true)
| | |-- distance_in_kms: float (nullable = true)
| | |-- estimated_time: string (nullable = true)

如何在 PySpark 中执行此操作?

更具体的例子,我想转换这个 DF(以 JSON 形式可视化)

{
"order": "c-331",
"travel": [
{
"place": {
"name": "A place",
"address": "The address",
"latitude": 0.0,
"longitude": 0.0
},
"distance_in_kms": 1.0,
"estimated_time": {
"seconds": 988,
"nanos": 102
}
}
]
}

进入

{
"order": "c-331",
"travel": [
{
"place": {
"name": "A place",
"address": "The address",
"latitude": 0.0,
"longitude": 0.0
},
"distance_in_kms": 1.0,
"estimated_time": "988s"
}
]
}

最佳答案

您可以使用以下 pyspark 函数来执行此操作:

  • withColumn让您创建一个新列。我们将用它来提取“estimated_time”
  • concat连接字符串列
  • lit创建给定字符串的列

请看下面的示例:

from pyspark.sql import functions as F
j = '{"order":"c-331","travel":[{"place":{"name":"A place","address":"The address","latitude":0.0,"longitude":0.0},"distance_in_kms":1.0,"estimated_time":{"seconds":988,"nanos":102}}]}'
df = spark.read.json(sc.parallelize([j]))

#the following command creates a new column called estimated_time2 which contains the values of travel.estimated_time.seconds concatenated with a 's'
bla = df.withColumn('estimated_time2', F.concat(df.travel.estimated_time.seconds[0].cast("string"), F.lit("s")))

#unfortunately it is currently not possible to use withColumn to add a new member to a struct. Therefore the following command replaces 'travel.estimated_time' with the before created column estimated_time2
bla = bla.select("order"
, F.array(
F.struct(
bla.travel.distance_in_kms[0].alias("distance_in_kms")
,bla.travel.place[0].alias("place")
, bla.estimated_time2.alias('estimated_time')
)).alias("travel"))

bla.show(truncate=False)
bla.printSchema()

这就是输出:

+-----+------------------------------------------+ 
|order|travel |
+-----+------------------------------------------+
|c-331|[[1.0,[The address,0.0,0.0,A place],988s]]|
+-----+------------------------------------------+


root
|-- order: string (nullable = true)
|-- travel: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- distance_in_kms: double (nullable = true)
| | |-- place: struct (nullable = true)
| | | |-- address: string (nullable = true)
| | | |-- latitude: double (nullable = true)
| | | |-- longitude: double (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- estimated_time: string (nullable = true)

关于python - PySpark 将数组内的结构字段转换为字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54343635/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com