gpt4 book ai didi

python - 模糊匹配列和合并/连接数据框

转载 作者:太空宇宙 更新时间:2023-11-03 19:41:50 25 4
gpt4 key购买 nike

我正在尝试合并 2 个具有多个列的数据帧,每个数据帧基于每个数据帧中的一列的匹配值。 @Erfan 的这段代码在模糊匹配目标列方面做得很好,但是有没有办法也携带其余的列。 https://stackoverflow.com/a/56315491/12802642

数据框

df1 = pd.DataFrame({'Key':['Apple Souce', 'Banana', 'Orange', 'Strawberry', 'John tabel']})
df2 = pd.DataFrame({'Key':['Aple suce', 'Mango', 'Orag','Jon table', 'Straw', 'Bannanna', 'Berry'],
'Key23':['1', '2', '3','4', '5', '6', '7'})

来自 @Erfan 的匹配函数,如上面的链接所述

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match, based on Levenshtein distance
limit is the amount of matches that will get returned, these are sorted high to low
"""
s = df_2[key2].tolist()

m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m

m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2

return df_1

调用函数

df = fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80, limit=1)
df.sort_values(by='Key',ascending=True).reset_index()

结果

index   Key            matches
0 Apple Souce Aple suce
1 Banana Bannanna
2 John tabel
3 Orange
4 Strawberry Straw

期望的结果

index   Key            matches       Key23
0 Apple Souce Aple suce 1
1 Banana Bannanna 6
2 John tabel
3 Orange
4 Strawberry Straw 5

最佳答案

对于那些需要这个的人。这是我想出的解决方案。
merge = pd.merge(df, df2, left_on=['matches'],right_on=['Key'],how='outer').fillna(0)
从那里您可以删除不必要的或重复的列并获得干净的结果,如下所示:
clean = merge.drop(['matches', 'Key_y'], axis=1)

关于python - 模糊匹配列和合并/连接数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60379947/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com