gpt4 book ai didi

python - pyspark:计算列表中不同元素的出现次数

转载 作者:行者123 更新时间:2023-12-02 11:34:42 25 4
gpt4 key购买 nike

我必须以下数据:

data = {'date': ['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04', '2014-01-05', '2014-01-06'],
'flat': ['A;A;B', 'D;P;E;P;P', 'H;X', 'P;Q;G', 'S;T;U', 'G;C;G']}

data['date'] = pd.to_datetime(data['date'])

data = pd.DataFrame(data)
data['date'] = pd.to_datetime(data['date'])
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "500g") \
.appName('my-pandasToSparkDF-app') \
.getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.sparkContext.setLogLevel("OFF")

df=spark.createDataFrame(data)
new_frame = df.withColumn("list", F.split("flat", "\;"))

enter image description here

我想添加一个新列,用于保存每个不同元素的出现次数(按升序排序),以及另一列用于保存最大值:

+-------------------+-----------+---------------------+-----------+----+
| date| flat | list |occurrences|max |
+-------------------+-----------+---------------------+-----------+----+
|2014-01-01 00:00:00|A;A;B |['A','A','B'] |[1,2] |2 |
|2014-01-02 00:00:00|D;P;E;P;P |['D','P','E','P','P']|[1,1,3] |3 |
|2014-01-03 00:00:00|H;X |['H','X'] |[1,1] |1 |
|2014-01-04 00:00:00|P;Q;G |['P','Q','G'] |[1,1,1] |1 |
|2014-01-05 00:00:00|S;T;U |['S','T','U'] |[1,1,1] |1 |
|2014-01-06 00:00:00|G;C;G |['G','C','G'] |[1,2] |2 |
+-------------------+-----------+---------------------+-----------+----+

非常感谢!

最佳答案

对于Spark2.4+,这可以无需多个groupBy和聚合来实现(因为它们是大型中昂贵的洗牌操作数据)。您可以使用高阶函数的一个表达式来实现此目的 变换>聚合。这应该是spark2.4的规范解决方案。

from pyspark.sql import functions as F
df=spark.createDataFrame(data)
df.withColumn("list", F.split("flat","\;"))\
.withColumn("occurances", F.expr("""array_sort(transform(array_distinct(list), x-> aggregate(list, 0,(acc,t)->acc+IF(t=x,1,0))))"""))\
.withColumn("max", F.array_max("occurances"))\
.show()
+-------------------+---------+---------------+----------+---+
| date| flat| list|occurances|max|
+-------------------+---------+---------------+----------+---+
|2014-01-01 00:00:00| A;A;B| [A, A, B]| [1, 2]| 2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]| [1, 1, 3]| 3|
|2014-01-03 00:00:00| H;X| [H, X]| [1, 1]| 1|
|2014-01-04 00:00:00| P;Q;G| [P, Q, G]| [1, 1, 1]| 1|
|2014-01-05 00:00:00| S;T;U| [S, T, U]| [1, 1, 1]| 1|
|2014-01-06 00:00:00| G;C;G| [G, C, G]| [1, 2]| 2|
+-------------------+---------+---------------+----------+---+

关于python - pyspark:计算列表中不同元素的出现次数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61171443/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com