gpt4 book ai didi

python - 使用 scikit-learn 加载文本数据时出现问题?

转载 作者:行者123 更新时间:2023-11-30 08:55:23 27 4
gpt4 key购买 nike

我使用自己的数据将一些数据分为两类,所以让:

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load the text data
categories = [
'CLASS_1',
'CLASS_2',
]

text_train_subset = load_files('train',
categories=categories)

text_test_subset = load_files('test',
categories=categories)

# Turn the text documents into vectors of word frequencies
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(text_train_subset)
y_train = text_train_subset.target


classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set
X_test = vectorizer.transform(text_test_subset.data)
y_test = text_test_subset.target
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))

对于上面的代码和 documentation ,我有以下目录架构:

data_folder/

train_folder/
CLASS_1.txt CLASS_2.txt
test_folder/
test.txt

然后我收到此错误:

    % (size, n_samples))
ValueError: Found array with dim 0. Expected 5

我也尝试过 fit_transform 但还是一样。如何解决这个维度问题?

最佳答案

第一个问题是您的目录结构错误。 You need it to be like

container_folder/
CLASS_1_folder/
file_1.txt, file_2.txt ...
CLASS_2_folder/
file_1.txt, file_2.txt, ....

您需要在此目录结构中包含训练集和测试集。或者,您可以将所有数据放在一个目录中并使用 train_test_split将其一分为二。

其次,

X_train = vectorizer.fit_transform(text_train_subset)

需要

X_train = vectorizer.fit_transform(text_train_subset.data) # added .data

这是一个完整且有效的示例:

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

text_train_subset = load_files('sample-data/web')
text_test_subset = text_train_subset # load your actual test data here

# Turn the text documents into vectors of word frequencies
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(text_train_subset.data)
y_train = text_train_subset.target


classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set
X_test = vectorizer.transform(text_test_subset.data)
y_test = text_test_subset.target
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))

sample-data/web的目录结构为

sample-data/web
├── de
│   ├── apollo8.txt
│   ├── fiv.txt
│   ├── habichtsadler.txt
└── en
├── elizabeth_needham.txt
├── equipartition_theorem.txt
├── sunderland_echo.txt
└── thespis.txt

关于python - 使用 scikit-learn 加载文本数据时出现问题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27761803/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com