gpt4 book ai didi

python - 如何在 pandas 中调用函数

转载 作者:太空宇宙 更新时间:2023-11-04 04:10:46 25 4
gpt4 key购买 nike

我在 Pandas 中有以下数据框

code     job_descr               job_type     
123 sales executive nan
124 data scientist nan
145 marketing manager nan
132 finance nan
144 data analyst nan

我想将job_descr分类为job_type如下

sales : Sales
marketing : Marketing
finance : Finance
data science : Analytics
analyst : Analytics

我正在用 pandas 进行跟踪

def job_type_redifine(column_name):
if column_name.str.contains('sales'):
return 'Sales'
elif column_name.str.contains('marketing'):
return 'Marketing'
elif column_name.str.contains('data science|data scientist|analyst|machine learning'):
return 'Analytics'
else:
return 'Others'


final_df['job_type'] = final_df.apply(lambda row:
job_type_redifine(row['job_descr']), axis=1)

所需的数据框

code     job_descr               job_type     
123 sales executive Sales
124 data scientist Analytics
145 marketing manager Marketing
132 finance Finance
144 data analyst Analytics

最佳答案

第一个解决方案是 numpy.selectSeries.str.contains , advatage 正在处理缺失值,但速度较慢:

m1 = final_df['job_descr'].str.contains('sales')
m2 = final_df['job_descr'].str.contains('marketing')
m3 = final_df['job_descr'].str.contains('data science|data scientist|analyst|machine learning')

final_df['job_type'] = np.select([m1, m2, m3],
['Sales','Marketing','Analytics'], default='Others')

print (final_df)
code job_descr job_type
0 123 sales executive Sales
1 124 data scientist Analytics
2 145 marketing manager Marketing
3 132 finance Others
4 144 data analyst Analytics

解决方案 Series.apply - 测试匹配值是使用 in,这里是按每个值循环,但它更快,因为 pandas 文本函数很慢。缺点是有很多 or 的最后一个条件有点复杂:

def job_type_redifine(column_name):
if 'sales' in column_name:
return 'Sales'
elif 'marketing' in column_name:
return 'Marketing'
elif ('data science' in column_name or 'data scientist' in column_name
or 'analyst' in column_name or 'machine learning' in column_name):
return 'Analytics'
else:
return 'Others'


final_df['job_type'] = final_df['job_descr'].apply(job_type_redifine)
print (final_df)
code job_descr job_type
0 123 sales executive Sales
1 124 data scientist Analytics
2 145 marketing manager Marketing
3 132 finance Others
4 144 data analyst Analytics

性能:

#[5000 rows x 3 columns]
final_df = pd.concat([final_df] * 1000, ignore_index=True)

In [13]: %%timeit
...: m1 = final_df['job_descr'].str.contains('sales')
...: m2 = final_df['job_descr'].str.contains('marketing')
...: m3 = final_df['job_descr'].str.contains('data science|data scientist|analyst|machine learning')
...:
...: final_df['job_type'] = np.select([m1, m2, m3], ['Sales','Marketing','Analytics'], default='Others')
...:
12.1 ms ± 611 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [14]: %%timeit
...: final_df['job_type1'] = final_df['job_descr'].apply(job_type_redifine)
...:
1.95 ms ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

关于python - 如何在 pandas 中调用函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56307388/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com