gpt4 book ai didi

python - 从 pandas 数据框中的句子列表中删除标点符号

转载 作者:行者123 更新时间:2023-12-01 09:10:37 24 4
gpt4 key购买 nike

我在 pandas 数据框中有电子邮件。在应用sent_tokenize之前,我可以像这样删除标点符号

def removePunctuation(fullCorpus):
punctuationRemoved = fullCorpus['text'].str.replace(r'[^\w\s]+', '')
return punctuationRemoved

应用sent_tokenize后,数据帧如下所示。如何删除标点符号,同时保持列表中句子的标记化?

sent_tokenize

def tokenizeSentences(fullCorpus):
sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
return sent_tokenized

Sample of data frame after tokenizing into sentences

[Nah I don't think he goes to usf, he lives around here though]                                                                                                                                                                                                                          

[Even my brother is not like to speak with me., They treat me like aids patent.]

[I HAVE A DATE ON SUNDAY WITH WILL!, !]

[As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers., Press *9 to copy your friends Callertune]

[WINNER!!, As a valued network customer you have been selected to receivea £900 prize reward!, To claim call 09061701461., Claim code KL341., Valid 12 hours only.]

最佳答案

您可以尝试使用以下函数,您可以使用 apply 迭代句子和字符中的每个单词,并检查字符是否在标点符号中,后跟 .join。另外,您可能需要 map 因为您想将函数应用于每个句子:

def tokenizeSentences(fullCorpus):
sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
f = lambda sent: ''.join(ch for w in sent for ch in w
if ch not in string.punctuation)

sent_tokenized = sent_tokenized.apply(lambda row: list(map(f, row)))
return sent_tokenized

注意您需要为string.punctuation导入字符串

关于python - 从 pandas 数据框中的句子列表中删除标点符号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51687596/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com