gpt4 book ai didi

python - 计算 python pandas 中两列之间的相同单词数

转载 作者:太空狗 更新时间:2023-10-30 00:47:39 26 4
gpt4 key购买 nike

假设我在 python pandas 中有下表

friend_description  friend_definition
James is dumb dumb dude
Jacob is smart smart guy
Jane is pretty she looks pretty
Susan is rich she is rich

此处,在第一行中,“dumb”一词包含在两列中。在第二行中,“smart”包含在两列中。在第三行中,'pretty' 包含在两列中,在最后一行中,'is' 和 'rich' 包含在两列中。我想创建以下列:

friend_description  friend_definition      word_overlap    overlap_count
James is dumb dumb dude dumb 1
Jacob is smart smart guy smart 1
Jane is pretty she looks pretty pretty 1
Susan is rich she is rich is rich 2

我可以使用 for 循环手动定义一个包含此类内容的新列,但我想知道 pandas 中是否有一个函数可以使此类操作更加顺畅。

最佳答案

简单的列表理解似乎是处理此类字符串时最快的方法:

In [112]: df['word_overlap'] = [set(x[0].split()) & set(x[1].split()) for x in df.values]

In [113]: df['overlap_count'] = df['word_overlap'].str.len()

In [114]: df
Out[114]:
friend_description friend_definition word_overlap overlap_count
0 James is dumb dumb dude {dumb} 1
1 Jacob is smart smart guy {smart} 1
2 Jane is pretty she looks pretty {pretty} 1
3 Susan is rich she is rich {rich, is} 2

单个 apply(..., axis=1):

In [85]: df['word_overlap'] = df.apply(lambda r: set(r['friend_description'].split()) &
...: set(r['friend_definition'].split()),
...: axis=1)
...:

In [86]: df['overlap_count'] = df['word_overlap'].str.len()

In [87]: df
Out[87]:
friend_description friend_definition word_overlap overlap_count
0 James is dumb dumb dude {dumb} 1
1 Jacob is smart smart guy {smart} 1
2 Jane is pretty she looks pretty {pretty} 1
3 Susan is rich she is rich {rich, is} 2

apply().apply(..., axis=1) 方法:

In [23]: df['word_overlap'] = (df.apply(lambda x: x.str.split(expand=False))
...: .apply(lambda r: set(r['friend_description']) & set(r['friend_definition']),
...: axis=1))
...:

In [24]: df['overlap_count'] = df['word_overlap'].str.len()

In [25]: df
Out[25]:
friend_description friend_definition word_overlap overlap_count
0 James is dumb dumb dude {dumb} 1
1 Jacob is smart smart guy {smart} 1
2 Jane is pretty she looks pretty {pretty} 1
3 Susan is rich she is rich {is, rich} 2

计时针对 40.000 行 DF:

In [104]: df = pd.concat([df] * 10**4, ignore_index=True)

In [105]: df.shape
Out[105]: (40000, 2)

In [106]: %timeit [set(x[0].split()) & set(x[1].split()) for x in df.values]
223 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [107]: %timeit df.apply(lambda r: set(r['friend_description'].split()) & set(r['friend_definition'].split()), axis=1)
3.65 s ± 46.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [108]: %timeit df.apply(lambda x: x.str.split(expand=False)).apply(lambda r: set(r['friend_description']) & set(r['friend_definition']),
...: axis=1)
4.63 s ± 84.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

关于python - 计算 python pandas 中两列之间的相同单词数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47744109/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com