gpt4 book ai didi

python - Pandas :使用枢轴函数进行复杂列转置的更快方法

转载 作者:太空宇宙 更新时间:2023-11-04 04:10:25 24 4
gpt4 key购买 nike

简单地说,我需要将下面的输入数据帧转换为下面的输出。

经过几个小时努力弄清楚如何通过结合之前的多个 stackoverflow 问题,我可以转换数据帧,但是由于我使用了 pivot 和 apply 方法,转换大型数据帧需要花费很多时间。

import numpy as np
import pandas as pd
df = pd.DataFrame({"id":[1,2,3,4,5],
"day":pd.Timestamp('20190529'),
"subject":"math,english,economics",
"score":pd.Categorical(["68,62,49","58,72,87","28,32,46","48,72,66","46,25,93"]),
"Department":pd.Categorical(["Economics","Computer Science","Sociology","Business","Math"])})


---Input DataFrame---

id day subject score Department
0 1 2019-05-29 math,english,economics 68,62,49 Economics
1 2 2019-05-29 math,economics 58,87 Computer Science
2 3 2019-05-29 philosophy,english,business 28,32,46 Sociology
3 4 2019-05-29 physics,sociology 72,66 Business
4 5 2019-05-29 Math 93 Math

输出如下图

---Output DataFrame---

id day Department Math business economics english math philosophy physics sociology
1 2019-05-29 Economics NaN NaN 49 62 68 NaN NaN NaN
2 2019-05-29 Computer Science NaN NaN 87 NaN 58 NaN NaN NaN
3 2019-05-29 Sociology NaN 46 NaN 32 NaN 28 NaN NaN
4 2019-05-29 Business NaN NaN NaN NaN NaN NaN 72 66
5 2019-05-29 Math 93 NaN NaN NaN NaN NaN NaN NaN

我的做法是

  1. 我用“,”拆分主题和分数列。
  2. 分解主题和分数列中的元素列表以将行分隔为 pandas.Series
  3. 加入每个 pandas.Series 以制作新的数据框
  4. 透视第 3 步中创建的新数据框
  5. 在原始数据框中删除主题和分数列
  6. 加入第 4 步和第 5 步中制作的每个数据框

而我的代码如下

df["subject"] = df["subject"].str.split(",")
df["score"] = df["score"].str.split(",")

subject = df.apply(lambda x: pd.Series(x['subject']),axis=1).stack().reset_index(level=1, drop=True)
score = df.apply(lambda x: pd.Series(x['score']),axis=1).stack().reset_index(level=1, drop=True)
subject.name = 'subject'
score.name = 'score'

subject_score = pd.concat([subject, score],join='outer', axis=1)
pdf = df.drop('subject', axis=1).drop("score", axis=1).join(subject_score)

pivot = pdf.pivot(columns="subject",values="score")
concate_table = df.drop("subject",axis = 1).drop("score", axis=1)
output = concate_table.join(pivot)

我最近才开始学习 pandas,我确信这不是列转置的最佳方式。

如果您能给我一些如何优化此代码的建议,我将不胜感激。

提前谢谢你。

最佳答案

我会定义一个自定义函数 stack_str使用 expand=True 将字符串列解压到数据框和 stackreset_index到一个系列。

申请stack_str到 2 列字符串来制作 df1 2列。

接下来,执行 pivotdf1制作subject值为 columnsscores作为values .最后加入回df已经删除了 2 个包含字符串的列。

def stack_str(x):
s = x.str.split(',', expand=True).stack().reset_index(level=-1, drop=True)
return s

df1 = df[['subject', 'score']].apply(stack_list)

Out[984]:
subject score
0 math 68
0 english 62
0 economics 49
1 math 58
1 economics 87
2 philosophy 28
2 english 32
2 business 46
3 physics 72
3 sociology 66
4 Math 93

df2 = df.drop(['subject', 'score'], axis=1).join(df1.pivot(columns='subject', values='score'))

Out[986]:
id day Department Math business economics english math \
0 1 2019-05-29 Economics NaN NaN 49 62 68
1 2 2019-05-29 Computer_Science NaN NaN 87 NaN 58
2 3 2019-05-29 Sociology NaN 46 NaN 32 NaN
3 4 2019-05-29 Business NaN NaN NaN NaN NaN
4 5 2019-05-29 Math 93 NaN NaN NaN NaN

philosophy physics sociology
0 NaN NaN NaN
1 NaN NaN NaN
2 28 NaN NaN
3 NaN 72 66
4 NaN NaN NaN

关于python - Pandas :使用枢轴函数进行复杂列转置的更快方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56370383/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com