gpt4 book ai didi

python - 通过多次连接自身表来创建新列

转载 作者:行者123 更新时间:2023-12-01 06:25:19 25 4
gpt4 key购买 nike

我有一个 pandas 数据框,其中包含大家庭成员的列表。

import pandas as pd

data = {'child':['Joe','Anna','Anna','Steffani','Bob','Rea','Dani','Dani','Selma','John','Kevin'],
'parents':['Steffani','Bob','Steffani','Dani','Selma','Anna','Selma','John','Kevin','-','Robert'],
}
df = pd.DataFrame(data)

在此数据框中,我需要通过在右侧添加多个列来构建一个新表,以显示数据之间的关系。右列的值显示了长辈关系。每列代表关系。如果我能画出图表,它可能看起来像这样:

child --> parents --> grandparents --> parents of grandparents --> grandparents of grandparents --> etc.

因此,数据帧的预期输出将如下所示:

    child       parents     A           B           C           D (etc)
---------------------------------------------------------------------------------
0 Joe Steffani Dani Selma Kevin <If still possible>
1 Joe Steffani Dani John -
2 Anna Bob Selma Kevin Robert
3 Anna Steffani Dani Selma Kevin
4 Anna Steffani Dani John -
5 Steffani Dani Selma Kevin Robert
6 Steffani Dani John - -
7 Bob Selma Kevin Robert -
8 Rea Anna Bob Selma Kevin
9 Rea Anna Steffani Dani Selma
10 Rea Anna Steffani Dani John
11 Dani Selma Kevin Robert -
12 Dani John - - -
13 Selma Kevin Robert - -
14 John - - - -
15 Kevin Robert - - -

目前,我使用 pandas.merge 手动构建新表。但我需要这样做很多次,直到最后一列与左列没有长辈关系。例如:

第 1 步

df2 = pd.merge(df, df, left_on='parents', right_on='child', how='left').fillna('-')
df2 = df2[['child_x','parents_x','parents_y']]
df2.columns = ['child','parents','A']

第 2 步

df3 = pd.merge(df2, df, left_on='A', right_on='child', how='left').fillna('-')
df3 = df3[['child_x','parents_x','A','parents_y']]
df3.columns = ['child','parents','A','B']

第3步

df4 = pd.merge(df3, df, left_on='B', right_on='child', how='left').fillna('-')
df4 = df4[['child_x','parents_x','A','B','parents_y']]
df4.columns = [['child','parents','A','B','C']]

第 4 步

如果C列中的值仍然存在上级关系,请编写类似的代码为D列添加第6列。

问题:

由于我的dataframe中有大数据(超过10K个数据点),如何在不逐步编写代码的情况下解决它?我不知道需要多少步骤才能构建决赛 table 。

预先感谢您的帮助。

最佳答案

考虑与 reduce 的链合并使用 mergesuffixes 参数对重复列名进行一些处理并删除中间列:

def proc_build(x,y):
temp = (pd.merge(x, y, left_on='parents', right_on='child',
how='left', suffixes=['_',''])
.fillna('-'))

return temp

final_df = (reduce(proc_build, [df, df, df, df])
.set_axis(['child', 'parents',
'child1', 'A',
'child2', 'B',
'child3', 'C'], axis='columns', inplace=False)
.reindex(['child', 'parents'] + list('ABC'), axis='columns')
)

print(final_df)

# child parents A B C
# 0 Joe Steffani Dani Selma Kevin
# 1 Joe Steffani Dani John -
# 2 Anna Bob Selma Kevin Robert
# 3 Anna Steffani Dani Selma Kevin
# 4 Anna Steffani Dani John -
# 5 Steffani Dani Selma Kevin Robert
# 6 Steffani Dani John - -
# 7 Bob Selma Kevin Robert -
# 8 Rea Anna Bob Selma Kevin
# 9 Rea Anna Steffani Dani Selma
# 10 Rea Anna Steffani Dani John
# 11 Dani Selma Kevin Robert -
# 12 Dani John - - -
# 13 Selma Kevin Robert - -
# 14 John - - - -
# 15 Kevin Robert - - -
<小时/>

要扩展另一列,例如 D,请添加另一个 dfreduceiterable 参数,并附加列出 set_axisreindex 中的项目,特别是 ['child4', 'D']list('ABCD').虽然有多种方法可以使这些项目动态化,但 reduce 可能会变得昂贵,因此应该通过一些声明性的强调来处理。

final_df = (reduce(proc_build, [df] * 5)
.set_axis(['child', 'parents',
'child1', 'A',
'child2', 'B',
'child3', 'C',
'child4', 'D'], axis='columns', inplace=False)
.reindex(['child', 'parents'] + list('ABCD'), axis='columns')
)

print(final_df)

# child parents A B C D
# 0 Joe Steffani Dani Selma Kevin Robert
# 1 Joe Steffani Dani John - -
# 2 Anna Bob Selma Kevin Robert -
# 3 Anna Steffani Dani Selma Kevin Robert
# 4 Anna Steffani Dani John - -
# 5 Steffani Dani Selma Kevin Robert -
# 6 Steffani Dani John - - -
# 7 Bob Selma Kevin Robert - -
# 8 Rea Anna Bob Selma Kevin Robert
# 9 Rea Anna Steffani Dani Selma Kevin
# 10 Rea Anna Steffani Dani John -
# 11 Dani Selma Kevin Robert - -
# 12 Dani John - - - -
# 13 Selma Kevin Robert - - -
# 14 John - - - - -
# 15 Kevin Robert - - - -

关于python - 通过多次连接自身表来创建新列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60181545/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com