python - PySpark:在 DataFrame 中的小组内部进行迭代-6ren

python - PySpark:在 DataFrame 中的小组内部进行迭代

转载作者：太空宇宙更新时间：2023-11-03 15:58:51

26

4

我试图了解如何在 PySpark DataFrame 中的小组内进行操作。假设我有具有以下架构的 DF:

root
|-- first_id: string (nullable = true)
|-- second_id_struct: struct (nullable = true)
|    |-- s_id: string (nullable = true)
|    |-- s_id_2: int (nullable = true)
|-- depth_from: float (nullable = true)
|-- depth_to: float (nullable = true)
|-- total_depth: float (nullable = true)

所以数据可能看起来像这样:

我愿意:

按first_id对数据进行分组
在每个组内，按 s_id_2 升序排列
将额外的列layer附加到结构或根DataFrame，以指示该s_id_2在组中的顺序。

例如:

first_id | second_id | second_id_order 
---------| --------- | ---------------
      A1 |   [B, 10] | 1  
---------| --------- | ---------------
      A1 |   [B, 14] | 2
---------| --------- | ---------------
      A1 |   [B, 22] | 3
---------| --------- | ---------------
      A5 |    [A, 1] | 1
---------| --------- | ---------------
      A5 |    [A, 7] | 2
---------| --------- | ---------------
      A7 |      null | 1
---------| --------- | ---------------

分组后，每个 first_id 将最多有 4 个 second_id_struct。我该如何解决这类问题？

我对如何在 DataFrame 的小组(1-40 行)内进行迭代操作特别感兴趣，其中组内的列顺序很重要。

谢谢!

最佳答案

创建一个数据框

d = [{'first_id': 'A1', 'second_id': ['B',10]}, {'first_id': 'A1', 'second_id': ['B',14]},{'first_id': 'A1', 'second_id': ['B',22]},{'first_id': 'A5', 'second_id': ['A',1]},{'first_id': 'A5', 'second_id': ['A',7]}]

df = sqlContext.createDataFrame(d)

你可以看到结构

df.printSchema()

|-- first_id: string (nullable = true)
|-- second_id: array (nullable = true)
|........|-- element: string (containsNull = true)

df.show()
+--------+----------+
|first_id|second_id |
+--------+----------+
|      A1|   [B, 10]|
|      A1|   [B, 14]|
|      A1|   [B, 22]|
|      A5|    [A, 1]|
|      A5|    [A, 7]|
+--------+----------+

然后您可以使用dense_rank和Window函数来显示子组中的顺序。与SQL中的over分区相同。

窗函数介绍:Introducing Window Functions in Spark SQL

代码在这里:

# setting a window spec
windowSpec = Window.partitionBy('first_id').orderBy(df.second_id[1])
# apply dense_rank to the window spec
df.select(df.first_id, df.second_id, dense_rank().over(windowSpec).alias("second_id_order")).show()

结果:

+--------+---------+---------------+
|first_id|second_id|second_id_order|
+--------+---------+---------------+
|      A1|  [B, 10]|              1|
|      A1|  [B, 14]|              2|
|      A1|  [B, 22]|              3|
|      A5|   [A, 1]|              1|
|      A5|   [A, 7]|              2|
+--------+---------+---------------+

关于python - PySpark:在 DataFrame 中的小组内部进行迭代，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40521218/

26

4

0

文章推荐： python - pyzmq 发布者可以从类实例进行操作吗？

文章推荐： c# - 需要 Mono 内置的 ChromeDriver

文章推荐： python - 从具有级别的数据中创建已完成的层次结构

文章推荐： python - 在Python中，从网站返回的响应对象是什么？

php - 如何分配位置以获得获胜团队/小组
我想为锦标赛的终点分配一组分数，其中第一名获得 10 分，第二名获得 9 分，依此类推。然后我想合并具有相同团队名称的玩家的积分，没有团队(空)的任何人都不会获得积分。然后返回积分最多的队伍名称(te

首页

博学

6Ren·AI

商城

python - PySpark:在 DataFrame 中的小组内部进行迭代