gpt4 book ai didi

python - 预处理脚本不删除标点符号

转载 作者:太空宇宙 更新时间:2023-11-03 20:41:08 25 4
gpt4 key购买 nike

我有一个代码应该预处理文本文档列表。也就是说:给定一个文本文档列表,它返回一个列表,其中每个文本文档都经过预处理。但由于某种原因,删除标点符号不起作用。

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download('punkt')
nltk.download('wordnet')


def preprocess(docs):
"""
Given a list of documents, return each documents as a string of tokens,
stripping out punctuation
"""
clean_docs = [clean_text(i) for i in docs]
tokenized_docs = [tokenize(i) for i in clean_docs]
return tokenized_docs

def tokenize(text):
"""
Tokenizes text -- returning the tokens as a string
"""
stop_words = stopwords.words("english")
nltk_tokenizer = nltk.WordPunctTokenizer().tokenize
tokens = nltk_tokenizer(text)
result = " ".join([i for i in tokens if not i in stop_words])
return result


def clean_text(text):
"""
Cleans text by removing case
and stripping out punctuation.
"""
new_text = make_lowercase(text)
new_text = remove_punct(new_text)
return new_text

def make_lowercase(text):
new_text = text.lower()
return new_text

def remove_punct(text):
text = text.split()
punct = string.punctuation
new_text = " ".join(word for word in text if word not in string.punctuation)
return new_text

# Get a list of titles
s1 = "[UPDATE] I am tired"
s2 = "I am cold."

clean_docs = preprocess([s1, s2])
print(clean_docs)

打印出:

['[更新]累了', '冷。']

换句话说,它不会删除标点符号,因为“[”、“]”和“.”全部出现在最终产品中。

最佳答案

您正在尝试搜索标点符号中的单词。显然 [UPDATE] 不是标点符号。

尝试在文本中搜索标点符号/替换标点符号:

import string


def remove_punctuation(text: str) -> str:
for p in string.punctuation:
text = text.replace(p, '')
return text


if __name__ == '__main__':
text = '[UPDATE] I am tired'
print(remove_punctuation(text))

# output:
# UPDATE I am tired

关于python - 预处理脚本不删除标点符号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56851886/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com