gpt4 book ai didi

python - 如何从语料库中删除无意义或不完整的单词?

转载 作者:行者123 更新时间:2023-11-30 09:17:33 29 4
gpt4 key购买 nike

我正在使用一些文本进行一些 NLP 分析。我已清理文本,采取措施删除非字母数字字符、空格、重复单词和停用词,并执行词干提取和词形还原:

from nltk.tokenize import word_tokenize
import nltk.corpus
import re
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd

data_df = pd.read_csv('path/to/file/data.csv')

stopwords = nltk.corpus.stopwords.words('english')

stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# Function to remove duplicates from sentence
def unique_list(l):
ulist = []
[ulist.append(x) for x in l if x not in ulist]
return ulist

for i in range(len(data_df)):

# Convert to lower case, split into individual words using word_tokenize
sentence = word_tokenize(data_df['O_Q1A'][i].lower()) #data['O_Q1A'][i].split(' ')

# Remove stopwords
filtered_sentence = [w for w in sentence if not w in stopwords]

# Remove duplicate words from sentence
filtered_sentence = unique_list(filtered_sentence)

# Remove non-letters
junk_free_sentence = []
for word in filtered_sentence:
junk_free_sentence.append(re.sub("[^\w\s]", " ", word)) # Remove non-letters, but don't remove whitespaces just yet
#junk_free_sentence.append(re.sub("/^[a-z]+$/", " ", word)) # Take only alphabests

# Stem the junk free sentence
stemmed_sentence = []
for w in junk_free_sentence:
stemmed_sentence.append(stemmer.stem(w))

# Lemmatize the stemmed sentence
lemmatized_sentence = []
for w in stemmed_sentence:
lemmatized_sentence.append(lemmatizer.lemmatize(w))

data_df['O_Q1A'][i] = ' '.join(lemmatized_sentence)

但是当我显示前 10 个单词(根据某些标准)时,我仍然会收到一些垃圾信息,例如:

ask
much
thank
work
le
know
via
sdh
n
sy
t
n t
recommend
never

在这 10 个最常用的词中,只有 5 个是明智的(询问知道推荐感谢工作)。我还需要做什么才能只保留有意义的单词?

最佳答案

默认的 NLTK 非索引字表是一个最小的非索引字表,它当然不包含“ask”“much”之类的单词,因为它们通常是无意义的。这些话只是与你无关,但对其他人可能无关。对于您的问题,您始终可以在使用 NLTK 后使用自定义停用词过滤器。一个简单的例子:

def removeStopWords(str):
#select english stopwords
cachedStopWords = set(stopwords.words("english"))
#add custom words
cachedStopWords.update(('ask', 'much', 'thank', 'etc.'))
#remove stop words
new_str = ' '.join([word for word in str.split() if word not in cachedStopWords])
return new_str

或者,您可以编辑 NLTK 停用词列表,该列表本质上是一个文本文件,存储在 NLTK 安装目录中。

关于python - 如何从语料库中删除无意义或不完整的单词?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51237716/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com