gpt4 book ai didi

python - 我们如何为 Pyspark Dataframe 列中最后一次出现的值设置标志

转载 作者:行者123 更新时间:2023-12-03 16:35:18 25 4
gpt4 key购买 nike

要求:这里当最后一次出现值为 1 的忠诚时,则将标志设置为 1,否则为 0

输入:

+-----------+----------+----------+-------+-----+---------+-------+---+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|
+-----------+----------+----------+-------+-----+---------+-------+---+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5|
| 11| 2|1156854301| MMVVV| 0|4/17/2020| 5| 5|
| 12| 1|1156854302| VVVVM| 1| 3/6/2020| 1| 3|
| 12| 1|1156854303| VVVVV| 1| 3/7/2020| 2| 3|
| 12| 2|1156854304| MVVVV| 1|3/16/2020| 3| 3|
+-----------+----------+----------+-------+-----+---------+-------+---+

df = spark.createDataFrame(
[('11','1','1152397078','VVVVM',1,'3/5/2020',1,5),
('11','1','1152944770','VVVVV',1,'3/6/2020',2,5),
('11','1','1153856408','VVVVV',1,'3/15/2020',3,5),
('11','2','1155884040','MVVVV',1,'4/2/2020',4,5),
('11','2','1156854301','MMVVV',0,'4/17/2020',5,5),
('12','1','1156854302','VVVVM',1,'3/6/2020',1,3),
('12','1','1156854303','VVVVV',1,'3/7/2020',2,3),
('12','2','1156854304','MVVVV',1,'3/16/2020',3,3)
]
,["consumer_id","product_id","TRX_ID","pattern","loyal","trx_date","row_num","mx"])
df.show()

输出要求:

注意:这里的 Flag 只检查最后一个忠诚值是否包含 1 并设置标志。
+-----------+----------+----------+-------+-----+---------+-------+---+----+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|Flag|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5| 0|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5| 0|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5| 0|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5| 1|
| 11| 2|1156854301| MMVVV| 0|4/17/2020| 5| 5| 0|
| 12| 1|1156854302| VVVVM| 1| 3/6/2020| 1| 3| 0|
| 12| 1|1156854303| VVVVV| 1| 3/7/2020| 2| 3| 0|
| 12| 2|1156854304| MVVVV| 1|3/16/2020| 3| 3| 1|
+-----------+----------+----------+-------+-----+---------+-------+---+----+

我试过的:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w2 = Window().partitionBy("consumer_id").orderBy('row_num')
df = spark.sql("""select * from inter_table""")
df = df.withColumn("Flag",F.when(F.last(F.col('loyal') == 1).over(w),1).otherwise(0))

这里有两种情况:

1. 值 1 前加 0(供您引用 row_num 4 for consumer_id 11)

2. 没有前面的值 1(供您引用 row_num 3 for consumer_id 12)

最佳答案

添加Murtaza的答案

我们可以添加一个新列,它将检查您的第二个场景是否为前面的 null

window = Window.partitionBy('Consumer_id').orderBy('row_num')
df.withColumn('Flag',f.when((f.col('loyal')==1)
& ((f.lead(f.col('loyal')).over(window)==0)
| (f.lead(f.col('loyal')).over(window).isNull())), f.lit('1')).otherwise(f.lit('0'))).show()

+-----------+----------+----------+-------+-----+---------+-------+---+----+
|consumer_id|product_id| TRX_ID|pattern|loyal| trx_date|row_num| mx|Flag|
+-----------+----------+----------+-------+-----+---------+-------+---+----+
| 11| 1|1152397078| VVVVM| 1| 3/5/2020| 1| 5| 0|
| 11| 1|1152944770| VVVVV| 1| 3/6/2020| 2| 5| 0|
| 11| 1|1153856408| VVVVV| 1|3/15/2020| 3| 5| 0|
| 11| 2|1155884040| MVVVV| 1| 4/2/2020| 4| 5| 1|
| 11| 2|1156854300| MMVVV| 0|4/17/2020| 5| 5| 0|
| 12| 1|1156854300| VVVVM| 1| 3/6/2020| 1| 4| 0|
| 12| 1|1156854300| VVVVV| 1| 3/7/2020| 2| 4| 0|
| 12| 2|1156854300| MVVVV| 1|3/16/2020| 3| 4| 0|
| 12| 1|1156854300| MVVVV| 1| 4/3/2020| 4| 4| 1|
+-----------+----------+----------+-------+-----+---------+-------+---+----+

关于python - 我们如何为 Pyspark Dataframe 列中最后一次出现的值设置标志,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61717683/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com