gpt4 book ai didi

python - 2 个数据框列之间的相似性

转载 作者:行者123 更新时间:2023-12-01 09:24:19 24 4
gpt4 key购买 nike

我有两个数据框,每个数据框都有一个名为“宋”的列。然而有时歌曲的拼写有所不同。如何使用 difflib (或类似的东西)在另一个数据帧的新列中获取一个数据帧的宋体拼写?

例如:

Dataframe1

Song Artist

like a virgi madonna


Dataframe2

Song Rank

like a virgin 2


Result

Song Artist SongAlt

like a virgin Madonna like a virgi

最佳答案

第 1 步:合并所有可以合并的内容

In [67]: df1
Out[67]:
Song Artist
0 mysong myartist
1 like a virgi madonna

In [68]: df2
Out[68]:
Song Rank
0 mysong 1
1 like a virgin 2

In [69]: merged = pd.merge(df1, df2, on='Song')

In [70]: merged
Out[70]:
Song Artist Rank
0 mysong myartist 1

第 2 步:找出剩余内容

In [71]: unmerged = df2[~df2.isin(merged)].dropna()

In [72]: unmerged
Out[72]:
Song Rank
1 like a virgin 2.0

第 3 步:使用 difflib 的 get_close_matches 获取最接近的匹配

In [73]: songs = list(df1['Song'].unique())

In [74]: def closest(a):
...: try:
...: return difflib.get_close_matches(a, songs)[0]
...: except IndexError:
...: return "Not Found"

In [75]: unmerged['closest_song'] = unmerged.apply(lambda row: closest(row['Song']), axis=1)

In [76]: unmerged
Out[76]:
Song Rank closest_song
1 like a virgin 2.0 like a virgi

第 4 步:根据需要获取相似度百分比

In [77]: def similar(a, b):
...: return difflib.SequenceMatcher(None, a, b).ratio()

In [78]: unmerged['Similarity'] = unmerged.apply(lambda row: similar(row['closest_song'], row['Song']), axis=1)

In [79]: unmerged
Out[79]:
Song Rank closest_song Similarity
1 like a virgin 2.0 like a virgi 0.96

第 5 步:使用最接近的值进行合并

In [80]: unmerged.rename(columns={'Song': 'Old_Song', 'closest_song': 'Song'}, inplace=True)

In [81]: new = unmerged.merge(df1, on='Song')[['Song', 'Artist', 'Rank']]
Out[81]:
Song Artist Rank
0 like a virgi madonna 2.0

In [82]: pd.concat([merged, new])
Out[82]:
Song Artist Rank
0 mysong myartist 1.0
0 like a virgi madonna 2.0

关于python - 2 个数据框列之间的相似性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50560174/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com