gpt4 book ai didi

python - 从 Pandas 行中删除多个重复出现的文本`

转载 作者:太空宇宙 更新时间:2023-11-04 02:21:33 25 4
gpt4 key购买 nike

我有一个 pandas 数据框,它由从网站上以行形式截取的文章组成。我有 10 万篇类似性质的文章。

这是我的数据集的一瞥。

text
0 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
1 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
2 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
3 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
4 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
5 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
6 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
7 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
8 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
for those who werent as productive as they would have liked during the first half of 2018
28 for those who werent as productive as they would have liked during the first half of 2018
29 for those who werent as productive as they would have liked during the first half of 2018
30 for those who werent as productive as they would have liked during the first half of 2018
31 for those who werent as productive as they would have liked during the first half of 2018
32 for those who werent as productive as they would have liked during the first half of 2018

现在,这些是每个文本的首字母,它们是重复的。正文位于这些文本之后。

有没有什么方法或功能可以识别这些文本并在几行代码中将它们刷出。

最佳答案

我认为您可以以某种方式使用 difflib,例如:

>>> import difflib
>>> a = "my mother always told me to mind my business"
>>> b = "my mother always told me to be polite"
>>> s = difflib.SequenceMatcher(None,a,b)
>>> s.find_longest_match(0,len(a),0,len(b))

输出:

Match(a=0, b=0, size=28)

其中a=0表示匹配序列从字符串a中的字符0开始,b=0 表示匹配序列从字符串 b 的字符 0 开始。

现在如果你这样做:

>>> b.replace(a[:28],"")

输出将是:

'be polite'

如果您选择执行 c = s.find_longest_match(0,len(a),0,len(b)) 那么 c[0] = 0 , c[1] = 0c[2] = 28

您可以在这里阅读更多相关信息: https://docs.python.org/2/library/difflib.html

关于python - 从 Pandas 行中删除多个重复出现的文本`,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51485289/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com