gpt4 book ai didi

python - pyspark : How to apply to a dataframe value another value depending on date in another dataframe

转载 作者:行者123 更新时间:2023-12-01 07:10:26 25 4
gpt4 key购买 nike

我的第一个数据帧 df 包含 start_date 和值,第二个数据帧 df_v 仅包含日期。

我的df:

+-------------------+-----+
| start_date |value|
+-------------------+-----+
|2019-03-17 00:00:00| 35|
+-------------------+-----+
|2019-05-20 00:00:00| 40|
+-------------------+-----+
|2019-06-03 00:00:00| 10|
+-------------------+-----+
|2019-07-01 00:00:00| 12|
+-------------------+-----+

我的df_v:

+-------------------+
| date |
+-------------------+
|2019-02-01 00:00:00|
+-------------------+
|2019-04-10 00:00:00|
+-------------------+
|2019-06-14 00:00:00|
+-------------------+

我想要的是新的df_v:

+-------------------+-------------+
| date | v_value |
+-------------------+-------------+
|2019-02-01 00:00:00| 0|
+-------------------+-------------+
|2019-04-10 00:00:00| (0+35) 35|
+-------------------+-------------+
|2019-06-14 00:00:00|(35+40+10) 85|
+-------------------+-------------+

尝试像这样工作:

df=df.withColumn("lead",lead(F.col("start_date"),1).over(Window.orderBy("start_date")))

for r_v in df_v.rdd.collect():
for r in df.rdd.collect():
if (r_v.date >= r.start_date) and (r_v.date < r.lead):
df_v = df_v.withColumn('v_value',
...

最佳答案

这可以通过连接和聚合来完成。

from pyspark.sql.functions import sum,when
#Join
joined_df = df_v.join(df,df.start_date <= df_v.date,'left')
joined_df.show() #View the joined result
#Aggregation
joined_df \
.groupBy(joined_df.date) \
.agg(sum((when(joined_df.value.isNull(),0).otherwise(joined_df.value))).alias('val')) \
.show()

关于python - pyspark : How to apply to a dataframe value another value depending on date in another dataframe,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58248774/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com