gpt4 book ai didi

python - 检查是否在 PySpark 数据框中的组内找到值

转载 作者:行者123 更新时间:2023-12-05 09:04:04 27 4
gpt4 key购买 nike

假设我有以下 df

df = spark.createDataFrame([
("a", "apple"),
("a", "pear"),
("b", "pear"),
("c", "carrot"),
("c", "apple"),
], ["id", "fruit"])

+---+-------+
| id| fruit|
+---+-------+
| a| apple|
| a| pear|
| b| pear|
| c| carrot|
| c| apple|
+---+-------+

我现在想为每个在水果列 fruit< 中至少有一列带有 "pear" 的 id 创建一个 bool 标志,该标志为 TRUE/.

所需的输出如下所示:

+---+-------+------+
| id| fruit| flag|
+---+-------+------+
| a| apple| True|
| a| pear| True|
| b| pear| True|
| c| carrot| False|
| c| apple| False|
+---+-------+------+

对于 pandas 我找到了一个解决方案 groupby().transform() here ,但我不明白如何将其转换为 PySpark。

最佳答案

使用max窗函数:

df.selectExpr("*", "max(fruit = 'pear') over (partition by id) as flag").show()

+---+------+-----+
| id| fruit| flag|
+---+------+-----+
| c|carrot|false|
| c| apple|false|
| b| pear| true|
| a| apple| true|
| a| pear| true|
+---+------+-----+

如果需要查询多个水果,可以使用in操作符。例如检查 carrotapple:

df.selectExpr("*", "max(fruit in ('carrot', 'apple')) over (partition by id) as flag").show()
+---+------+-----+
| id| fruit| flag|
+---+------+-----+
| c|carrot| true|
| c| apple| true|
| b| pear|false|
| a| apple| true|
| a| pear| true|
+---+------+-----+

如果你更喜欢 python 语法:

from pyspark.sql.window import Window
import pyspark.sql.functions as f

df.select("*",
f.max(
f.col('fruit').isin(['carrot', 'apple'])
).over(Window.partitionBy('id')).alias('flag')
).show()
+---+------+-----+
| id| fruit| flag|
+---+------+-----+
| c|carrot| true|
| c| apple| true|
| b| pear|false|
| a| apple| true|
| a| pear| true|
+---+------+-----+

关于python - 检查是否在 PySpark 数据框中的组内找到值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69091252/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com