gpt4 book ai didi

python ,pyspark : get sum of a pyspark dataframe column values

转载 作者:太空狗 更新时间:2023-10-30 01:43:24 27 4
gpt4 key购买 nike

假设我有这样一个数据框

name age city
abc 20 A
def 30 B

我想在数据框的末尾添加一个摘要行,所以结果会像

name age city
abc 20 A
def 30 B
All 50 All

所以 String 'All',我可以很容易地输入,但是如何获取 sum(df['age']) ###column object is not iterable

data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
data.printSchema()
#root
#|-- name: string (nullable = true)
#|-- age: long (nullable = true)
#|-- city: string (nullable = true)
res = data.union(spark.createDataFrame([('All',sum(data['age']),'All')], data.columns)) ## TypeError: Column is not iterable
#Even tried with data['age'].sum() and got error. If i am using [('All',50,'All')], it is doing fine.

我通常在 Pandas 数据框上工作,但对 Spark 不熟悉。可能是我对 spark dataframe 的理解还不够成熟。

请建议,如何通过 pyspark 中的数据框列获得总和。如果有任何更好的方法可以将行添加/附加到数据帧的末尾。谢谢。

最佳答案

Spark SQL 有一个专用于列函数的模块 pyspark.sql.functions .
所以它的工作方式是:

from pyspark.sql import functions as F
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])

res = data.unionAll(
data.select([
F.lit('All').alias('name'), # create a cloumn named 'name' and filled with 'All'
F.sum(data.age).alias('age'), # get the sum of 'age'
F.lit('All').alias('city') # create a column named 'city' and filled with 'All'
]))
res.show()

打印:

+----+---+----+
|name|age|city|
+----+---+----+
| abc| 20| A|
| def| 30| B|
| All| 50| All|
+----+---+----+

关于 python ,pyspark : get sum of a pyspark dataframe column values,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39504950/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com