gpt4 book ai didi

python - 优化Pyspark的Collect_List函数

转载 作者:太空宇宙 更新时间:2023-11-03 20:16:07 25 4
gpt4 key购买 nike

我需要聚合我的数据,以便生成以下输出:

JSON 输出

{
"people": [
{
"firstName": "Jimi",
"lastName": "Hendrix",
"age": "27"
},
{
"firstName": "Jimmy",
"lastName": "Page",
"age": "75"
}
]
}

但是,当运行聚合函数(如下)时,我收到此错误:

Caused by: org:apache.spark.SparkSession: Job aborted due to stage failure: Total size of serialized results of <task_size> task (20GB) is bigger than spark.driver.maxResultSize (20.0 GB)

This leads me to believe that the collect_list function is what's causing this problem. Instead of running the tasks in parallel, they run on a single node and run out of memory.

创建 JSON 输出的最佳方法是什么?有没有办法优化 collect_list 函数?

示例代码:

def aggregate(df):
return df.agg(collect_list(struct(
df.firstName,
df.lastName,
df.age
)).alias('people'))

最佳答案

当您执行收集列表时,所有数据都会作为列表收集到驱动程序。相反,您可以使用 JSON 在数据框中创建一列并将其存储为 CSV。

替代方法是增加驱动程序内存 -

conf.set("spark.driver.maxResultSize", "25g")

关于python - 优化Pyspark的Collect_List函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58436301/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com