gpt4 book ai didi

python - 迭代 PySpark GroupedData

转载 作者:太空狗 更新时间:2023-10-30 00:01:07 25 4
gpt4 key购买 nike

让我们假设原始数据是这样的:

Competitor  Region  ProductA  ProductB
Comp1 A £10 £15
Comp1 B £11 £16
Comp1 C £11 £15
Comp2 A £9 £16
Comp2 B £12 £14
Comp2 C £14 £17
Comp3 A £11 £16
Comp3 B £10 £15
Comp3 C £12 £15

(引用:Python - splitting dataframe into multiple dataframes based on column values and naming them with those values)

我希望根据列值获取子数据框列表,例如区域,例如:

df_A :

Competitor Region ProductA ProductB
Comp1 A £10 £15
Comp2 A £9 £16
Comp3 A £11 £16

在 Python 中我可以这样做:

for region, df_region in df.groupby('Region'):
print(df_region)

如果 df 是 Pyspark df,我可以做同样的迭代吗?

在 Pyspark 中,一旦我执行 df.groupBy("Region"),我就会得到 GroupedData。我不需要任何聚合,如计数、平均值等。我只需要子数据帧列表,每个子数据帧都具有相同的“区域”值。可能吗?

最佳答案

假设分组列中的唯一值列表足够小以适合驱动程序的内存,下面的方法应该适合您。希望这对您有所帮助!

import pyspark.sql.functions as F
import pandas as pd

# Sample data
df = pd.DataFrame({'region': ['aa','aa','aa','bb','bb','cc'],
'x2': [6,5,4,3,2,1],
'x3': [1,2,3,4,5,6]})
df = spark.createDataFrame(df)

# Get unique values in the grouping column
groups = [x[0] for x in df.select("region").distinct().collect()]

# Create a filtered DataFrame for each group in a list comprehension
groups_list = [df.filter(F.col('region')==x) for x in groups]

# show the results
[x.show() for x in groups_list]

结果:

+------+---+---+
|region| x2| x3|
+------+---+---+
| cc| 1| 6|
+------+---+---+

+------+---+---+
|region| x2| x3|
+------+---+---+
| bb| 3| 4|
| bb| 2| 5|
+------+---+---+

+------+---+---+
|region| x2| x3|
+------+---+---+
| aa| 6| 1|
| aa| 5| 2|
| aa| 4| 3|
+------+---+---+

关于python - 迭代 PySpark GroupedData,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51472144/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com