gpt4 book ai didi

python - Pandas 如果完整的字符串包含在另一个 Pandas 数据框中

转载 作者:太空狗 更新时间:2023-10-30 02:53:20 24 4
gpt4 key购买 nike

我想使用数据框对零件进行分类。

简化问题以尝试显示问题:

data = {'col1': ['engine','blue engine cover','spark plug',
'rear panel','black rear panel', 'blue engine']}
desc_df = pd.DataFrame(data=data)

catg = {'bodywork': ['engine cover','side panel','rear panel'],'underhood':['engine','spark plug','oil filter'],
'Glass':['Windscreen','window','demister']}

catg_df = pd.DataFrame(data=catg)

catg_df


Glass bodywork underhood
0 Windscreen engine cover engine
1 window side panel spark plug
2 demister rear panel oil filter

desc_df

col1
0 engine
1 blue engine cover
2 spark plug
3 rear panel
4 black rear panel
5 blue engine

我想结束:

  col1                Category
0 engine underhood
1 blue engine cover underhood
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood

我想到的最接近的是:

d=catg_df.apply('|'.join).to_dict()

desc_df['Category'] = desc_df['col1'].apply(lambda x : ''.join([z if pd.Series(x).str.contains(y).values else '' for z,y in d.items()]))

但我最终在字符串中找到了“engine”和“engine cover”: desc_df

col1                   Category
0 engine underhood
1 blue engine cover bodyworkunderhood
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood

如果它首先找到“engine Cover”然后使用此类别进行分类并且不会移动到“engine”,是否有一些我可以使用的方法。

最佳答案

一种方法可能是使用 difflib 来获得最接近的值和 lambda:

首先创建一个映射器:

from difflib import get_close_matches
mapper = {val:k for k, v in catg_df.to_dict('list').items() for val in v}
print(mapper)

因此,映射器将是:

{'Windscreen': 'Glass',
'demister': 'Glass',
'engine': 'underhood',
'engine cover': 'bodywork',
'oil filter': 'underhood',
'rear panel': 'bodywork',
'side panel': 'bodywork',
'spark plug': 'underhood',
'window': 'Glass'}

现在,使用 lambdadifflib 来找到最接近的值:

# avoid calling mapper.keys() in lambda 
keys = mapper.keys()
desc_df['Category'] = desc_df['col1'].apply(lambda row: mapper[get_close_matches(row, keys)[0]])

结果:

                col1   Category
0 engine underhood
1 blue engine cover bodywork
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood

关于python - Pandas 如果完整的字符串包含在另一个 Pandas 数据框中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49945187/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com