gpt4 book ai didi

python - 使用带有特定单词的 pandas 提取句子

转载 作者:太空宇宙 更新时间:2023-11-04 00:43:22 26 4
gpt4 key购买 nike

我有一个包含文本列的 Excel 文件。我需要做的就是从每一行的文本列中提取带有特定单词的句子。

我试过使用定义一个函数。

import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

#################Reading in excel file#####################

str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx")

################# Defining a function #####################

def sentence_finder(text,word):
sentences=sent_tokenize(text)
return [sent for sent in sentences if word in word_tokenize(sent)]
################# Finding Context ##########################
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',))

################# Output file #################################
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")

但是,如果我必须找到包含多个特定词的句子,例如 snakesvenomousanaconda,有人可以帮助我吗?这句话至少应该有一个词。我无法使用多个单词处理 nltk.tokenize

待搜索words = ['snakes','venomous','anaconda']

输入Excel文件:

                    text
1. Snakes are venomous. Anaconda is venomous.
2. Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
3. Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an anaconda.Because it is venomous.
4. Python is dangerous too.

期望的输出:

名为上下文的列附加到上面的文本列。上下文列应该是这样的:

 1.  [Snakes are venomous.] [Anaconda is venomous.]
2. [Anaconda lives in Amazon.] [It is venomous.]
3. [Snakes,snakes,snakes everywhere!] [The least I expect is an anaconda.Because it is venomous.]
4. NULL

提前致谢。

最佳答案

方法如下:

In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if w.lower() in searched_words)])

0 [Snakes are venomous., Anaconda is venomous.]
1 [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2 [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3 []
Name: text, dtype: object

您看到有几个问题,因为 sent_tokenizer 由于标点符号而没有正确完成它的工作。


更新:处理复数。

这是一个更新的 df:

text
Snakes are venomous. Anaconda is venomous.
Anaconda lives in Amazon. Amazon is a big forest. It is venomous.
Snakes,snakes,snakes everywhere! Mummyyyyyyy!!! The least I expect is an anaconda. Because it is venomous.
Python is dangerous too.
I have snakes


df = pd.read_clipboard(sep='0')

我们可以使用词干分析器 ( Wikipedia ),例如 PorterStemmer .

from nltk.stem.porter import *
stemmer = nltk.PorterStemmer()

首先,让我们对搜索到的词进行词干化和小写化:

searched_words = ['snakes','Venomous','anacondas']
searched_words = [stemmer.stem(w.lower()) for w in searched_words]
searched_words

> ['snake', 'venom', 'anaconda']

现在我们可以修改上面的内容以包括词干提取:

print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))

0 [Snakes are venomous., Anaconda is venomous.]
1 [Anaconda lives in Amazon., It is venomous.]
2 [Snakes,snakes,snakes everywhere!, The least I expect is an anaconda., Because it is venomous.]
3 []
4 [I have snakes]
Name: text, dtype: object

如果您只想进行子字符串匹配,请确保 searched_words 是单数,而不是复数。

 print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any([(w2.lower() in w.lower()) for w in word_tokenize(sent)
for w2 in searched_words])
])
)

顺便说一下,这是我可能会创建一个带有常规 for 循环的函数的地方,这个带有列表理解的 lambda 已经失控了。

关于python - 使用带有特定单词的 pandas 提取句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40861341/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com