gpt4 book ai didi

group-by - groupby 之外的列的 pyspark collect_set

转载 作者:行者123 更新时间:2023-12-04 11:01:55 24 4
gpt4 key购买 nike

我正在尝试使用 collect_set 获取类别名称字符串列表 不是 groupby 的一部分。
我的代码是

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F

sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("1", "cat1", "Dept1", "product1", 7),
("2", "cat2", "Dept1", "product1", 100),
("3", "cat2", "Dept1", "product2", 3),
("4", "cat1", "Dept2", "product3", 5),
], ["id", "category_name", "department_id", "product_id", "value"])

df.show()
df.groupby("department_id", "product_id")\
.agg({'value': 'sum'}) \
.show()

# .agg( F.collect_set("category_name"))\

输出是
+---+-------------+-------------+----------+-----+
| id|category_name|department_id|product_id|value|
+---+-------------+-------------+----------+-----+
| 1| cat1| Dept1| product1| 7|
| 2| cat2| Dept1| product1| 100|
| 3| cat2| Dept1| product2| 3|
| 4| cat1| Dept2| product3| 5|
+---+-------------+-------------+----------+-----+

+-------------+----------+----------+
|department_id|product_id|sum(value)|
+-------------+----------+----------+
| Dept1| product2| 3|
| Dept1| product1| 107|
| Dept2| product3| 5|
+-------------+----------+----------+

我想要这个输出
+-------------+----------+----------+----------------------------+
|department_id|product_id|sum(value)| collect_list(category_name)|
+-------------+----------+----------+----------------------------+
| Dept1| product2| 3| cat2 |
| Dept1| product1| 107| cat1, cat2 |
| Dept2| product3| 5| cat1 |
+-------------+----------+----------+----------------------------+

尝试 1
df.groupby("department_id", "product_id")\
.agg({'value': 'sum'}) \
.agg(F.collect_set("category_name")) \
.show()

我收到了这个错误:

pyspark.sql.utils.AnalysisException: "cannot resolve 'category_name' given input columns: [department_id, product_id, sum(value)];;\n'Aggregate [collect_set('category_name, 0, 0) AS collect_set(category_name)#35]\n+- Aggregate [department_id#2, product_id#3], [department_id#2, product_id#3, sum(value#4L) AS sum(value)#24L]\n +- LogicalRDD [id#0, category_name#1, department_id#2, product_id#3, value#4L]\n"



尝试 2 我把 category_name 作为 groupby 的一部分
df.groupby("category_name", "department_id", "product_id")\
.agg({'value': 'sum'}) \
.agg(F.collect_set("category_name")) \
.show()

它可以工作,但输出不正确
+--------------------------+
|collect_set(category_name)|
+--------------------------+
| [cat1, cat2]|
+--------------------------+

最佳答案

您可以specify multiple aggregations within one agg() .您的案例的正确语法是:

df.groupby("department_id", "product_id")\
.agg(F.sum('value'), F.collect_set("category_name"))\
.show()
#+-------------+----------+----------+--------------------------+
#|department_id|product_id|sum(value)|collect_set(category_name)|
#+-------------+----------+----------+--------------------------+
#| Dept1| product2| 3| [cat2]|
#| Dept1| product1| 107| [cat1, cat2]|
#| Dept2| product3| 5| [cat1]|
#+-------------+----------+----------+--------------------------+

您的方法不起作用,因为第一个 .agg()pyspark.sql.group.GroupedData 上工作并返回一个新的 DataFrame。随后调用 agg实际上是 pyspark.sql.DataFrame.agg 这是

shorthand for df.groupBy.agg()



所以本质上是对 agg 的第二次调用再次分组,这不是你想要的。

关于group-by - groupby 之外的列的 pyspark collect_set,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58756259/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com