gpt4 book ai didi

python - Pandas :前 N 行,每组前 N 行,相当于 ROW_NUMBER OVER(PARTITION BY ... ORDER BY ...)

转载 作者:太空狗 更新时间:2023-10-30 01:57:43 26 4
gpt4 key购买 nike

Python中T-SQL中TOP函数的对应是什么?我希望将我的数据框过滤到前 50K 行。我在网上看过,找不到简单的例子。

最佳答案

更新: - 显示不同的 pandas 方法,包括:

每组前 N 行

带偏移量的前 N ​​行

等效于 SQL 聚合函数:

ROW_NUMBER()/RANK() OVER(PARTITION BY ... ORDER BY ...)

示例 DF:

df = pd.DataFrame({
'dep': np.random.choice(list('ABC'), 20),
'manager_id': np.random.randint(0, 10, 20),
'salary': np.random.randint(5000, 5006, 20)
})

------------------------ 原始DF -------------------- --

In [2]: df
Out[2]:
dep manager_id salary
0 B 5 5005
1 A 6 5001
2 C 8 5000
3 A 7 5000
4 B 0 5002
5 A 3 5003
6 A 2 5004
7 A 2 5004
8 C 3 5002
9 C 4 5001
10 A 9 5002
11 C 9 5000
12 B 8 5004
13 A 1 5003
14 C 7 5005
15 B 0 5002
16 B 2 5003
17 A 4 5000
18 B 2 5003
19 B 7 5003

---------------- 前 5 行(按原始索引排序)----------------

In [3]: df.head(5)
Out[3]:
dep manager_id salary
0 B 5 5005
1 A 6 5001
2 C 8 5000
3 A 7 5000
4 B 0 5002

--- 前 5 行(按 manager_id DESC,dep ASC 排序)----

In [4]: df.sort_values(by=['manager_id', 'dep'], ascending=[False,True]).head(5)
Out[4]:
dep manager_id salary
10 A 9 5002
11 C 9 5000
12 B 8 5004
2 C 8 5000
3 A 7 5000

--- 相当于 SELECT * FROM tab ORDER BY salary DESC LIMIT 5 OFFSET 3 ---

In [19]: df.nlargest(5+3, columns=['salary']).tail(5)
Out[19]:
dep manager_id salary
7 A 2 5004
12 B 8 5004
5 A 3 5003
13 A 1 5003
16 B 2 5003

----各部门薪水前2名(无重复)-----

--- SQL 的等价物:row_number() over(partition by DEP order by SALARY desc) ---

In [7]: (df.assign(rn=df.sort_values(['salary'], ascending=False)
...: .groupby(['dep'])
...: .cumcount() + 1)
...: .query('rn < 3')
...: .sort_values(['dep','rn'])
...: )
Out[7]:
dep manager_id salary rn
6 A 2 5004 1
7 A 2 5004 2
0 B 5 5005 1
12 B 8 5004 2
14 C 7 5005 1
8 C 3 5002 2

---每个部门的前2名薪水(使用“nlargest”)----

In [15]: df.loc[df.groupby('dep')['salary'].nlargest(2).reset_index()['level_1']]
Out[15]:
dep manager_id salary
6 A 2 5004
7 A 2 5004
0 B 5 5005
12 B 8 5004
14 C 7 5005
8 C 3 5002

---每个部门第二和第三高的工资---

In [16]: (df.assign(rn=df.sort_values(['salary'], ascending=False)
....: .groupby(['dep'])
....: .cumcount() + 1)
....: .query('rn >= 2 and rn <= 3')
....: .sort_values(['dep','rn'])
....: )
Out[16]:
dep manager_id salary rn
7 A 2 5004 2
13 A 1 5003 3
12 B 8 5004 2
18 B 2 5003 3
8 C 3 5002 2
9 C 4 5001 3

---各部门薪资前2名(重复)----

--- SQL 的等价物:rank() over(partition by DEP order by SALARY desc) ---

In [18]: (df.assign(rnk=df.groupby(['dep'])['salary']
....: .rank(method='min', ascending=False))
....: .query('rnk < 3')
....: .sort_values(['dep','rnk'])
....: )
Out[18]:
dep manager_id salary rnk
6 A 2 5004 1.0
7 A 2 5004 1.0
0 B 5 5005 1.0
12 B 8 5004 2.0
14 C 7 5005 1.0
8 C 3 5002 2.0

关于python - Pandas :前 N 行,每组前 N 行,相当于 ROW_NUMBER OVER(PARTITION BY ... ORDER BY ...),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36702904/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com