gpt4 book ai didi

python - 在Python代码中实现多类文本分类的n-gram

转载 作者:太空宇宙 更新时间:2023-11-03 21:03:51 32 4
gpt4 key购买 nike

我是Python新手,正在研究建筑行业契约(Contract)文件的多类文本分类。我在代码中实现 n 元语法时遇到问题,这些代码是我通过从不同的在线资源获得帮助而生成的。我想在我的代码中实现一元语法、二元语法和三元语法。在这方面的任何帮助都将受到高度赞赏。

我在代码的 Tfidf 部分尝试了二元组和三元组,但它有效。

    df = pd.read_csv('projectdataayes.csv')
df = df[pd.notnull(df['types'])]
my_types = ['Requirement','Non-Requirement']

#converting to lower case
df['description'] = df.description.map(lambda x: x.lower())

#Removing the punctuation
df['description'] = df.description.str.replace('[^\w\s]', '')

#splitting the word into tokens
df['description'] = df['description'].apply(tokenize.word_tokenize)

#stemming
stemmer = PorterStemmer()
df['description'] = df['description'].apply(lambda x: [stemmer.stem(y) for y in x])

print(df[:10])

## This converts the list of words into space-separated strings
df['description'] = df['description'].apply(lambda x: ' '.join(x))
count_vect = CountVectorizer()
counts = count_vect.fit_transform(df['description'])


X_train, X_test, y_train, y_test = train_test_split(counts, df['types'], test_size=0.3, random_state=39)

tfidf_vect_ngram = TfidfVectorizer(analyzer='word',
token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(df['description'])
X_train_Tfidf = tfidf_vect_ngram.transform(X_train)
X_test_Tfidf = tfidf_vect_ngram.transform(X_test)

model = MultinomialNB().fit(X_train, y_train)

文件“C:\Users\fhassan\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py”,第 328 行,位于 tokenize(预处理(self.decode(doc))), stop_words)

文件“C:\Users\fhassan\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py”,第 256 行,位于 返回 lambda x: strip_accents(x.lower())

文件“C:\Users\fhassan\anaconda3\lib\site-packages\scipy\sparse\base.py”,第 686 行,getattr 引发 AttributeError(attr + "未找到")

属性错误:未找到下层

最佳答案

首先,您在文本上安装矢量化器:

tfidf_vect_ngram.fit(df['description']) 

然后尝试将其应用于计数:

counts = count_vect.fit_transform(df['description'])
X_train, X_test, y_train, y_test = train_test_split(counts, df['types'], test_size=0.3, random_state=39)
tfidf_vect_ngram.transform(X_train)

您需要将矢量化器应用于文本,而不是计数:

X_train, X_test, y_train, y_test = train_test_split(df['description'], df['types'], test_size=0.3, random_state=39)
tfidf_vect_ngram.transform(X_train)

关于python - 在Python代码中实现多类文本分类的n-gram,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55555159/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com