gpt4 book ai didi

python - Pandas - 提高应用方法的性能

转载 作者:行者123 更新时间:2023-12-03 20:06:48 29 4
gpt4 key购买 nike

我有一个场景,我需要根据同一行中另一列中存在的值和另一个数据帧中的值来转换特定列的值。

例子-

print(parent_df)
school location modifed_date
0 school_1 New Delhi 2020-04-06
1 school_2 Kolkata 2020-04-06
2 school_3 Bengaluru 2020-04-06
3 school_4 Mumbai 2020-04-06
4 school_5 Chennai 2020-04-06

print(location_df)
school location
0 school_10 New Delhi
1 school_20 Kolkata
2 school_30 Bengaluru
3 school_40 Mumbai
4 school_50 Chennai

根据这个用例,我需要转换 parent_df 中的学校名称。 ,基于 location列存在于同一 df 中,位置属性存在于 location_df

为了实现这种转换,我编写了以下方法。
def transform_school_name(row, location_df):
name_alias = location_df[location_df['location'] == row['location']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']

这就是我调用这个方法的方式
parent_df['school'] = parent_df.apply(UtilityMethods.transform_school_name, args=(self.location_df,), axis=1)

问题是,对于 46K 记录,我看到整个转换发生在大约 2 分钟内,这太慢了。如何提高此解决方案的性能?

已编辑

以下是我正在处理的实际场景,其中需要完成一个小的转换,然后才能替换原始列中的值。我不确定这是否可以在 replace() 内完成以下答案之一中提到的方法。
print(parent_df)
school location modifed_date type
0 school_1 _pre_New Delhi_post 2020-04-06 Govt
1 school_2 _pre_Kolkata_post 2020-04-06 Private
2 school_3 _pre_Bengaluru_post 2020-04-06 Private
3 school_4 _pre_Mumbai_post 2020-04-06 Govt
4 school_5 _pre_Chennai_post 2020-04-06 Private

print(location_df)
school location type
0 school_10 New Delhi Govt
1 school_20 Kolkata Private
2 school_30 Bengaluru Private

自定义方法代码
def transform_school_name(row, location_df):
location_values = row['location'].split('_')
name_alias = location_df[location_df['location'] == location_values[1]]
name_alias = name_alias[name_alias['type'] == location_df['type']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']


def transform_school_name(row, location_df):
name_alias = location_df[location_df['location'] == row['location']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']

这是我需要处理的实际场景,所以使用 replace()方法无济于事。

最佳答案

您可以使用 map/replace :

parent_df['school'] = parent_df.location.replace(location_df.set_index('location')['school'])

输出:
      school   location modifed_date
0 school_10 New Delhi 2020-04-06
1 school_20 Kolkata 2020-04-06
2 school_30 Bengaluru 2020-04-06
3 school_40 Mumbai 2020-04-06
4 school_50 Chennai 2020-04-06

关于python - Pandas - 提高应用方法的性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61890042/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com