gpt4 book ai didi

azure - Pyspark - 读取复杂的 json 文件

转载 作者:行者123 更新时间:2023-12-03 06:17:50 25 4
gpt4 key购买 nike

我在 blob 容器中分配了一个具有以下结构的 JSON 文件:

{
"tables":[
{
"name":"PrimaryResult",
"columns":[
{
"name":"TenantId",
"type":"string"
},
{
"name":"TimeGenerated",
"type":"datetime"
},
{
"name":"CorrelationId",
"type":"string"
},
{
"name":"UserName",
"type":"string"
}
],
"rows":[
[
"1341234",
"5143534",
"14314123",
"<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="384c5d4b4c784c5d4b4c165b54574d5c" rel="noreferrer noopener nofollow">[email protected]</a>"
]
]
}
]
}

当我尝试运行它时,它返回:

enter image description here

我想读取该文件并为我动态创建表结构。

期望的输出:

<表类=“s-表”><标题>租户ID生成时间相关 ID用户名 <正文>1341234514353414314123 [email protected]

有人可以帮我实现这个目标吗?

谢谢!

最佳答案

需要将multiline设置为true,多行json可以引用这个answer

以下是读取 json 并将其转换为多列的代码:

df = spark.read.option("inferSchema", "true").option("multiline", "true").json("file.json")
df.withColumn("tables", explode(col("tables"))).select("tables.*")\
.withColumn("rows", explode(col("rows"))).withColumn("rows", explode(col("rows"))).withColumn("columns", explode(col("columns")))\
.select("columns.*", "name", "rows").show()

结果:

+-------------+--------+-------------+---------------+
| name| type| name| rows|
+-------------+--------+-------------+---------------+
| TenantId| string|PrimaryResult| 1341234|
|TimeGenerated|datetime|PrimaryResult| 1341234|
|CorrelationId| string|PrimaryResult| 1341234|
| UserName| string|PrimaryResult| 1341234|
| TenantId| string|PrimaryResult| 5143534|
|TimeGenerated|datetime|PrimaryResult| 5143534|
|CorrelationId| string|PrimaryResult| 5143534|
| UserName| string|PrimaryResult| 5143534|
| TenantId| string|PrimaryResult| 14314123|
|TimeGenerated|datetime|PrimaryResult| 14314123|
|CorrelationId| string|PrimaryResult| 14314123|
| UserName| string|PrimaryResult| 14314123|
| TenantId| string|PrimaryResult|<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e397869097a397869097cd808f8c9687" rel="noreferrer noopener nofollow">[email protected]</a>|
|TimeGenerated|datetime|PrimaryResult|<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="eb9f8e989fab9f8e989fc58887849e8f" rel="noreferrer noopener nofollow">[email protected]</a>|
|CorrelationId| string|PrimaryResult|<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f084958384b084958384de939c9f8594" rel="noreferrer noopener nofollow">[email protected]</a>|
| UserName| string|PrimaryResult|<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="304455434470445543441e535c5f4554" rel="noreferrer noopener nofollow">[email protected]</a>|
+-------------+--------+-------------+---------------+

更新:

如果你想旋转你的数据,你可以只调用.pivot(col),注意我在旋转时应用了.first(),这是随机的,如果你想调用一个特定的值,那么你必须调整你的值聚合函数:

df = spark.read.option("inferSchema", "true").option("multiline", "true").json("file.json")
df = df.withColumn("tables", explode(col("tables"))).select("tables.*") \
.withColumn("rows", explode(col("rows"))).withColumn("rows", explode(col("rows"))).drop("name")\
.withColumn("columns", explode(col("columns"))).select("columns.*", "rows").orderBy("name")
df.groupBy(lit(1)).pivot("name").agg(first(col("rows"))).drop("1").show()

结果:

+-------------+--------+-------------+--------+
|CorrelationId|TenantId|TimeGenerated|UserName|
+-------------+--------+-------------+--------+
| 1341234| 1341234| 1341234| 1341234|
+-------------+--------+-------------+--------+

更新2:

如果要将数组列中的值与另一个数组列中的相同索引值关联起来,可以使用 arrays_zip 函数:

df = spark.read.option("inferSchema", "true").option("multiline", "true").json("file.json")
df = df.withColumn("tables", explode(col("tables"))).select("tables.*").withColumn("rows", explode(col("rows")))\
.withColumn("tmp", explode(arrays_zip("columns", "rows"))).select("tmp.columns.name", "tmp.rows")
df.groupBy(lit(1)).pivot("name").agg(first(col("rows"))).drop("1").show()

结果:

+-------------+--------+-------------+---------------+
|CorrelationId|TenantId|TimeGenerated| UserName|
+-------------+--------+-------------+---------------+
| 14314123| 1341234| 5143534|<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0a7e6f797e4a7e6f797e246966657f6e" rel="noreferrer noopener nofollow">[email protected]</a>|
+-------------+--------+-------------+---------------+

关于azure - Pyspark - 读取复杂的 json 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76073065/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com