gpt4 book ai didi

python - 合并 pandas 中超过 50% 匹配的两个 df 列表字符串列表

转载 作者:行者123 更新时间:2023-12-01 00:27:23 24 4
gpt4 key购买 nike

我有两个 df:

df1 = pd.DataFrame({'name1': ['ID1', 'ID2','ID3','ID4','ID5','ID6','ID7'], 'col1': ["S3,S22,S44", "S133,S32,S334", "S13,S24,S45", "S1,S2,S4,S5", "S3,S4,S5", "S3,S2,S5", "S38,S42,S9"],'col2': ['ab', 'ac','ad','ae','af','an','ak'],})
df2 = pd.DataFrame({'name2': ['Ik3', 'Ik1','Ik2','Ik7','Ik5','Ik6','Ik5'], 'col1': ["S3, S44, S22,S54", "S133, S32,S334, S30", "S13, S24,S45", "S11, S21,S4, S5", "S3, S4,S5", "S3, S22,S5", "S3, S4,S9, S10, S13"],'col2': ['ab', 'ae','ac','ad','af','ab','ak']})

df1

name1 col1          col2
ID1 S3,S22,S44 ab
ID2 S133,S32,S334 ac
ID3 S13,S24,S45 ad
ID4 S1,S2,S4,S5 ae
ID5 S3,S4,S5 af
ID6 S3,S2,S5 an
ID7 S38,S42,S9 ak

df2

name2   col1              col2
Ik3 S3,S44,S22,S54 ab
Ik1 S133,S32,S334,S30 ae
Ik2 S13,S24,S45 ac
Ik7 S11,S21,S4,S5 ad
Ik5 S3,S4,S5 af
Ik6 S3,S22,S5 ab
Ik5 S3,S4,S9,S10,S13 ak

想要比较 col2 列表的两个 df 并合并匹配率超过 50% 的内容,并将其余部分留空:

期望的输出:

name1   col1          col2 M_name2  M_col1            M_col2 percentage 
ID1 S3,S22,S44 ab Ik3 S3,S44,S22,S54 ab 75
ID1 S3,S22,S44 ab Ik6 S3, S22,S5 ab 75
ID2 S133,S32,S334 ac Ik1 S133,S32,S334,S30 ae 50
ID3 S13,S24,S45 ad Ik2 S13,S24,S45 ac 100
ID4 S1,S2,S4,S5 ae Ik5 S3,S4,S5 af 50
ID5 S3,S4,S5 af Ik5 S3,S4,S5 af 100
ID6 S3,S2,S5 an Ik5 S3,S4,S5 af 75
ID7 S38,S42,S9 ak NaN NaN NaN NaN

我尝试过 isin 函数:

df1[df1.col2.isin(df2.col2)]

但没有得到所需的输出。如有任何建议,我们将不胜感激。

最佳答案

您的描述和输出不匹配。不过,这里有一些代码希望可以帮助您入门。

def get_ratios(df1, df2):
for a,b in zip(df1.col2, df2.col2):
clean = lambda s: list(map(str.strip, s.split(',')))
vals1, vals2 = clean(a), clean(b)

inter = set(vals1).intersection(vals2)
ratio = len(inter)/max(len(vals1), len(vals2))

yield ratio

s = pd.Series(get_ratios(df1, df2))

然后合并

(df1.merge(df2.rename(columns={'col1': 'nma1'}), 
on=['col3'])
.assign(percentage=s)) #.where(s > 0.5)
<小时/>
  col1          col2_x col3 nma1               col2_y  percentage
0 ID1 S3, S22,S44 ab Ik3 S3, S44, S22,S54 0.750000
1 ID2 S133, S32,S334 ac Ik1 S133, S32,S334, S30 0.750000
2 ID3 S13, S2 4,S45 ad Ik2 S13, S24,S45 0.666667
3 ID4 S1, S2,S4 S5 ae Ik7 S11, S21,S4, S5 0.000000
4 ID5 S3, S4,S5 af Ik5 S3, S4,S5 1.000000
5 ID6 S3, S2,S5 as Ik6 S3, S2,S5 1.000000
6 ID7 S3, S4,S9 ak Ik5 S3, S4,S9, S10, S13 0.600000

关于python - 合并 pandas 中超过 50% 匹配的两个 df 列表字符串列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58456499/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com