gpt4 book ai didi

regex - 使用正则表达式识别 pandas 列中的模式和清理数据

转载 作者:行者123 更新时间:2023-12-01 23:09:31 25 4
gpt4 key购买 nike

我有一个包含公司创新数据的数据集,通过使用一些正则表达式,我想检索许可证数据

company licences/patents
1 UX226, licence-pp-zz, licence-zz-pp, licence-xx-tt
2 VV3346E, SS345
3 licence-dd-zz
4 UT223, licence, ss
5 XBTYU, licence-tt-kk, licence-ss-tt
6 xc, zz
7 licence-xb-xz

期望的输出:

company licences/patents                                    licence
1 UX226, licence-pp-zz, licence-zz-pp, licence-xx-tt licence-pp-zz, licence-zz-pp, licence-xx-tt
2 VV3346E, SS345
3 licence-dd-zz licence-dd-zz
4 UT223, licence, ss
5 XBTYU, licence-tt-kk, licence-ss-tt licence-tt-kk, licence-ss-tt
6 xc, zz
7 licence-xb-xz licence-xb-xz

最佳答案

你可以试试:

df['licence'] = df['licences/patents'].str.extractall('(licence-\w{2}-\w{2})')\
.unstack().apply(lambda x: ', '.join(x.dropna()), axis=1)

输出:

   company                                   licences/patents                                      licence
0 1 UX226, licence-pp-zz, licence-zz-pp, licence-x... licence-pp-zz, licence-zz-pp, licence-xx-tt
1 2 VV3346E, SS345 NaN
2 3 licence-dd-zz licence-dd-zz
3 4 UT223, licence, ss NaN
4 5 XBTYU, licence-tt-kk, licence-ss-tt licence-tt-kk, licence-ss-tt
5 6 xc, zz NaN
6 7 licence-xb-xz licence-xb-xz

关于regex - 使用正则表达式识别 pandas 列中的模式和清理数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55779266/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com