gpt4 book ai didi

python - pandas 数据框中的文本模式识别

转载 作者:太空宇宙 更新时间:2023-11-03 15:17:51 25 4
gpt4 key购买 nike

我正在尝试让 python 匹配 pandas 数据框中的文本模式。

我正在做的是

list = ['sarcasm','irony','humor']
pattern = '|'.join(list)
pattern2 = str("( " + pattern.strip().lstrip().rstrip() + " )").strip().lstrip().rstrip()

frame = pd.DataFrame(docs_list, columns=['words'])
# docs_list is the list containing the snippets

#Skipping the inbetween steps for the simplicity of viewing
cp2 = frame.words.str.extract(pattern2)
c2 = cp2.to_frame().fillna("No Matching Word Found")

这给出了这样的输出

Snips                                     pattern_found    matching_Word
A different type of humor True humor
A different type of sarcasm True sarcasm
A different type of humor and irony True humor
A different type of reason False NA
A type of humor and sarcasm True humor
A type of comedy False NA

因此,python 检查模式并给出相应的输出。

现在,这是我的问题。根据我的理解,只要 python 没有遇到片段中模式中的单词,它就会继续检查整个模式。一旦遇到模式的一部分,它就会采用该部分并跳过剩余的单词。

我如何让Python查找每个单词而不仅仅是第一个匹配的单词,以便它像这样输出?

Snips                                     pattern_found    matching_Word
A different type of humor True humor
A different type of sarcasm True sarcasm
A different type of humor and irony True humor
A different type of humor and irony True irony
A different type of reason False NA
A type of humor and sarcasm True humor
A type of humor and sarcasm True sarcasm
A type of comedy False NA

一个简单的解决方案显然是将模式放入列表中,并通过检查每个片段中的每个单词来迭代 for 循环。但时间是一个限制。特别是因为我正在处理的数据集很大并且片段相当长。

最佳答案

对我来说有效extractallreset_index用于删除 MultiIndex 级别,最后 join恢复原样。

L = ['sarcasm','irony','humo', 'humor', 'hum']
#sorting by http://stackoverflow.com/a/4659539/2901002
L.sort()
L.sort(key = len, reverse=True)
print (L)
['sarcasm', 'humor', 'irony', 'humo', 'hum']

pattern2 = r'(?P<COL>{})'.format('|'.join(L))
print (pattern2)
(?P<COL>sarcasm|irony|humor|humo|hum)

cp2 = frame.words.str.extractall(pattern2).reset_index(level=1, drop=True)
print (cp2)
COL
0 humor
1 sarcasm
2 humor
2 irony
4 humor
4 sarcasm

frame = frame.join(cp2['COL']).reset_index(drop=True)
print (frame)
words pattern_found matching_Word COL
0 A different type of humor True humor humor
1 A different type of sarcasm True sarcasm sarcasm
2 A different type of humor and irony True humor humor
3 A different type of humor and irony True humor irony
4 A different type of reason False NaN NaN
5 A type of humor and sarcasm True humor humor
6 A type of humor and sarcasm True humor sarcasm
7 A type of comedy False NaN NaN

关于python - pandas 数据框中的文本模式识别,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43732106/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com