gpt4 book ai didi

python - 将列中的字符串作为嵌套 JSON 存储到 JSON 文件 - Pyspark

转载 作者:行者123 更新时间:2023-12-02 07:19:17 27 4
gpt4 key购买 nike

我有一个 pyspark 数据框,这就是它的样子

+------------------------------------+-------------------+-------------+--------------------------------+---------+
|member_uuid |Timestamp |updated |member_id |easy_id |
+------------------------------------+-------------------+-------------+--------------------------------+---------+
|027130fe-584d-4d8e-9fb0-b87c984a0c20|2020-02-11 19:15:32|password_hash|ajuypjtnlzmk4na047cgav27jma6_STG|993269700|

我将上面的数据框转换为这个,

 +---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation|params |timestamp |
+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|profile |UPDATE |{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"}|2020-02-11 19:15:32|

使用以下代码,

ll = ['member_uuid', 'member_id', 'easy_id', 'field']
df = df.withColumn('timestamp', col('Timestamp')).withColumn('attribute', lit('profile')).withColumn('operation', lit(col_name)) \
.withColumn('field', col('updated')).withColumn('params', F.to_json(struct([x for x in ll])))
df = df.select('attribute', 'operation', 'params', 'timestamp')

我已将此数据帧 df 转换为 JSON 后保存到文本文件中。我尝试使用以下代码来执行相同的操作,

df_final.toJSON().coalesce(1).saveAsTextFile('file')

该文件包含,

{"attribute":"profile","operation":"UPDATE","params":"{\"member_uuid\":\"027130fe-584d-4d8e-9fb0-b87c984a0c20\",\"member_id\":\"ajuypjtnlzmk4na047cgav27jma6_STG\",\"easy_id\":993269700,\"field\":\"password_hash\"}","timestamp":"2020-02-11T19:15:32.000Z"}

我希望它以这种格式保存,

{"attribute":"profile","operation":"UPDATE","params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"},"timestamp":"2020-02-11T19:15:32.000Z"}

to_json 将 params 列中的值保存为字符串,有没有办法将 json 上下文保留在这里,以便我可以将其保存为所需的输出?

最佳答案

不要使用 to_json 在数据框中创建 params 列。

  • 这里的技巧只是创建struct并写入文件(使用.saveAsTextFile(或).write.json()) Spark 将为 Struct 字段创建 JSON。

  • 如果我们已经创建了 json 对象并以 json 格式写入,Spark 会将 \ 添加到 escape引号 已存在于 Json 字符串中。

示例:

from pyspark.sql.functions import *

#sample data
df=spark.createDataFrame([("027130fe-584d-4d8e-9fb0-b87c984a0c20","2020-02-11 19:15:32","password_hash","ajuypjtnlzmk4na047cgav27jma6_STG","993269700")],["member_uuid","Timestamp","updated","member_id","easy_id"])

df1=df.withColumn("attribute",lit("profile")).withColumn("operation",lit("UPDATE"))

df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").write.format("json").mode("overwrite").save("<path>")

#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}

df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").toJSON().saveAsTextFile("<path>")

#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}

关于python - 将列中的字符串作为嵌套 JSON 存储到 JSON 文件 - Pyspark,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60943862/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com