gpt4 book ai didi

python - 如何获取pyspark数据框中具有最大值的列的名称

转载 作者:太空狗 更新时间:2023-10-29 20:21:02 26 4
gpt4 key购买 nike

我们如何获取列 pyspark 数据框的名称?

   Alice  Eleonora  Mike  Helen       MAX
0 2 7 8 6 Mike
1 11 5 9 4 Alice
2 6 15 12 3 Eleonora
3 5 3 7 8 Helen

我需要这样的东西。列的名称没有最大值,我能够获得最大值,我需要名称

最佳答案

您可以链接条件以查找哪些列等于最大值:

cond = "psf.when" + ".when".join(["(psf.col('" + c + "') == psf.col('max_value'), psf.lit('" + c + "'))" for c in df.columns])
import pyspark.sql.functions as psf
df.withColumn("max_value", psf.greatest(*df.columns))\
.withColumn("MAX", eval(cond))\
.show()

+-----+--------+----+-----+---------+--------+
|Alice|Eleonora|Mike|Helen|max_value| MAX|
+-----+--------+----+-----+---------+--------+
| 2| 7| 8| 6| 8| Mike|
| 11| 5| 9| 4| 11| Alice|
| 6| 15| 12| 3| 15|Eleonora|
| 5| 3| 7| 8| 8| Helen|
+-----+--------+----+-----+---------+--------+

或:分解和过滤

from itertools import chain
df.withColumn("max_value", psf.greatest(*df.columns))\
.select("*", psf.posexplode(psf.create_map(list(chain(*[(psf.lit(c), psf.col(c)) for c in df.columns])))))\
.filter("max_value = value")\
.select(df.columns + [psf.col("key").alias("MAX")])\
.show()

或者:在字典上使用UDF:

from pyspark.sql.types import *
argmax_udf = psf.udf(lambda m: max(m, key=m.get), StringType())
df.withColumn("map", psf.create_map(list(chain(*[(psf.lit(c), psf.col(c)) for c in df.columns]))))\
.withColumn("MAX", argmax_udf("map"))\
.drop("map")\
.show()

或:使用带有参数的UDF:

from pyspark.sql.types import *
def argmax(cols, *args):
return [c for c, v in zip(cols, args) if v == max(args)][0]
argmax_udf = lambda cols: psf.udf(lambda *args: argmax(cols, *args), StringType())
df.withColumn("MAX", argmax_udf(df.columns)(*df.columns))\
.show()

关于python - 如何获取pyspark数据框中具有最大值的列的名称,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46819405/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com