python - 如何从列 pyspark 中获得第二高的值？-6ren

python - 如何从列 pyspark 中获得第二高的值？

转载作者：行者123 更新时间：2023-12-04 12:14:53

我有一个 PySpark DataFrame，我想获得第二高的值 ORDERED_TIME (日期时间字段 yyyy-mm-dd 格式)在 groupBy 应用到 2 列后，即 CUSTOMER_ID和 ADDRESS_ID .
一个客户可以有多个与一个地址相关联的订单，我想获得一个 (customer,address) 的第二个最近的订单。一对
我的方法是根据CUSTOMER_ID做一个窗口和分区。和 ADDRESS_ID , 按 ORDERED_TIME 排序

sorted_order_times = Window.partitionBy("CUSTOMER_ID", "ADDRESS_ID").orderBy(col('ORDERED_TIME').desc())

df2 = df2.withColumn("second_recent_order", (df2.select("ORDERED_TIME").collect()[1]).over(sorted_order_times))

但是，我收到一条错误消息 ValueError: 'over' is not in list任何人都可以提出解决这个问题的正确方法吗？
如果需要任何其他信息，请告诉我
样本数据

+-----------+----------+-------------------+
|USER_ID    |ADDRESS_ID|       ORDER DATE  | 
+-----------+----------+-------------------+
|        100| 1000     |2021-01-02         |
|        100| 1000     |2021-01-14         |
|        100| 1000     |2021-01-03         |
|        100| 1000     |2021-01-04         |
|        101| 2000     |2020-05-07         |
|        101| 2000     |2021-04-14         |
+-----------+----------+-------------------+

预期产出

+-----------+----------+-------------------+-------------------+
|USER_ID    |ADDRESS_ID|       ORDER DATE  |second_recent_order
+-----------+----------+-------------------+-------------------+
|        100| 1000     |2021-01-02          |2021-01-04 
|        100| 1000     |2021-01-14          |2021-01-04 
|        100| 1000     |2021-01-03          |2021-01-04 
|        100| 1000     |2021-01-04          |2021-01-04 
|        101| 2000     |2020-05-07          |2020-05-07 
|        101| 2000     |2021-04-14          |2020-05-07 
+-----------+----------+-------------------+-------------------

最佳答案

这是另一种方法。使用 collect_list

import pyspark.sql.functions as F
from pyspark.sql import Window


sorted_order_times = Window.partitionBy("CUSTOMER_ID", "ADDRESS_ID").orderBy(F.col('ORDERED_TIME').desc()).rangeBetween(Window.unboundedPreceding,  Window.unboundedFollowing)
df2 = (
  df
  .withColumn("second_recent_order", (F.collect_list(F.col("ORDERED_TIME")).over(sorted_order_times))[1])
)
df2.show()

关于python - 如何从列 pyspark 中获得第二高的值？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69279865/

文章推荐： iOS 15 : How to display ATT dialog when the app starts in SwiftUI

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何从列 pyspark 中获得第二高的值？