gpt4 book ai didi

arrays - 如何使用 spark SQL 正确分解 JSON 中的字段

转载 作者:行者123 更新时间:2023-12-04 08:35:56 27 4
gpt4 key购买 nike

我正在使用 spark.sql() 提取数据以提高性能。但是我有这个令人难以置信的嵌套 JSON,我无法从中获取数据。
以下是 JSON 的架构:

root
|-- httpStatus: long (nullable = true)
|-- httpStatusMessage: string (nullable = true)
|-- response: struct (nullable = true)
| |-- body: struct (nullable = true)
| | |-- dataProviders: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- dataProviderId: long (nullable = true)
| | | | |-- drivers: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- driverFirstName: string (nullable = true)
| | | | | | |-- driverId: long (nullable = true)
| | | | | | |-- driverLastName: string (nullable = true)
| | | | | | |-- driverRef: string (nullable = true)
| | | | | | |-- totalDistance: double (nullable = true)
| | | | | | |-- vehicles: array (nullable = true)
| | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | |-- deviceId: long (nullable = true)
| | | | | | | | |-- deviceRef: string (nullable = true)
| | | | | | | | |-- trips: array (nullable = true)
| | | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | | |-- averageSpeed: double (nullable = true)
| | | | | | | | | | |-- tripDistanceTravelled: double (nullable = true)
| | | | | | | | | | |-- tripDuration: double (nullable = true)
| | | | | | | | | | |-- tripId: string (nullable = true)
| | | | | | | | | | |-- tripStart: struct (nullable = true)
| | | | | | | | | | | |-- heading: double (nullable = true)
| | | | | | | | | | | |-- latitude: double (nullable = true)
| | | | | | | | | | | |-- longitude: double (nullable = true)
| | | | | | | | | | | |-- mileage: double (nullable = true)
| | | | | | | | | | | |-- speed: double (nullable = true)
| | | | | | | | | | | |-- timestamp: string (nullable = true)
| | | | | | | | | | |-- tripStop: struct (nullable = true)
| | | | | | | | | | | |-- heading: double (nullable = true)
| | | | | | | | | | | |-- latitude: double (nullable = true)
| | | | | | | | | | | |-- longitude: double (nullable = true)
| | | | | | | | | | | |-- mileage: double (nullable = true)
| | | | | | | | | | | |-- speed: double (nullable = true)
| | | | | | | | | | | |-- timestamp: string (nullable = true)
| | | | | | | | |-- vehicleId: long (nullable = true)
| | | | | | | | |-- vehicleRef: string (nullable = true)
| |-- header: struct (nullable = true)
| | |-- accelUnit: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- distanceUnit: string (nullable = true)
| | |-- fleetId: long (nullable = true)
| | |-- fleetName: string (nullable = true)
| | |-- gpsUnit: string (nullable = true)
| | |-- speedUnit: string (nullable = true)
|-- timestamp: string (nullable = true)
我一直在尝试分解这些字段以到达最嵌套的字段,但我无法通过 arrayType .
这是我的代码示例:
json_df = spark.read.json('/user/myuser/drivers_directory/driverRates.json')

json_df.printSchema()

json_df.show()
+----------+-----------------+--------------------+-------------------+
|httpStatus|httpStatusMessage| response| timestamp|
+----------+-----------------+--------------------+-------------------+
| 200| success|[[[[14, [[Eric, 1...|2020-11-11T19:46:01|
+----------+-----------------+--------------------+-------------------+

body_df = json_df.select('response.*').show()

json_df.select('response.*').select('body.*').show()
+--------------------+
| dataProviders|
+--------------------+
|[[14, [[Eric, 100...|
+--------------------+


json_df.select('response.*').select('body.*').select('dataProviders.dataProviderId').show()
+--------------+
|dataProviderId|
+--------------+
| [14]|
+--------------+
然而,在每个领域都这样做非常乏味,而且对性能来说很糟糕。
我一直在尝试使用 spark.sql() 来解决所有问题,但我收到了基于 StructType 的错误和 arrayType想要这样的东西:
json_df.createOrReplaceTempView('driver_dictionary')

final_driver_df = spark.sql("""select
, httpStatus as status
, httpStatusMessage as message
, timestamp as time
from driver_dictionary
lateral view explode(response) as r
""")
我遇到的问题是试图爆炸 body 和它下面的数据。使用横向 View 时出现 StructType 错误,使用横向 View 时出现 ArrayType 错误。一些帮助将不胜感激。

最佳答案

我要找的是这个:

drivers_exploded_df = spark.sql('''select
httpStatus
, httpStatusMessage
, response.header.*
, dataProviders.dataProviderId
, drivers.driverId
, drivers.driverRef
, drivers.firstName
, drivers.lastName
, timestamp
from drivers_explode
lateral view outer explode (response.body.dataProviders) providers_tbl as dataProviders
lateral view outer explode (dataProviders.drivers) dataProviders_drivers as drivers''')

关于arrays - 如何使用 spark SQL 正确分解 JSON 中的字段,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64795032/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com