gpt4 book ai didi

python - 如何在 scikit-learn 中正确加载文本数据?

转载 作者:行者123 更新时间:2023-11-30 09:11:47 25 4
gpt4 key购买 nike

我正在关注this example在 scikit-learn 中为文本数据创建多项式朴素贝叶斯分类器。然而,混淆矩阵和分类器F-1分数的输出是不正确的。我认为这些错误与我使用的输入数据格式有关。我每个训练示例都有一个 csv 文件。 csv 文件包含一行,其中包含诸如“blah、blahblah 等等”之类的功能。每个文件都被分类为正面或负面。如何才能正确读取这个文件?

这是我的代码:

import numpy
import csv
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score

NEWLINE = '\n'

NEGATIVE = 'negative'
POSITIVE = 'positive'

SOURCES = [
('negative\\', NEGATIVE),
('positive\\', POSITIVE)
]

SKIP_FILES = {'cmds'}


def build_data_frame(policies, path, classification):
rows = []
index = []

for policy in policies:

current_csv = path + policy + '.csv'

# check if file exists
if (os.path.isfile(current_csv)):

with open(current_csv, 'r') as csvfile:

reader = csv.reader(csvfile, delimiter=',', quotechar='"')

# get each row in policy
for row in reader:
# remove all commas from inside the text lists
clean_row = ' '.join(row)
rows.append({'text': clean_row, 'class': classification})
index.append(current_csv)

data_frame = DataFrame(rows, index=index)
return data_frame


def policy_analyzer_main(policies, write_pol_path):
data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
data = data.append(build_data_frame(policies, write_pol_path + path, classification))
classify(data)

pipeline = Pipeline([
('count_vectorizer', CountVectorizer()),
('classifier', MultinomialNB())
])

def classify(data):

k_fold = KFold(n=len(data), n_folds=10)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
train_text = data.iloc[train_indices]['text'].values
train_y = data.iloc[train_indices]['class'].values.astype(str)

test_text = data.iloc[test_indices]['text'].values
test_y = data.iloc[test_indices]['class'].values.astype(str)

pipeline.fit(train_text, train_y)
predictions = pipeline.predict(test_text)

confusion += confusion_matrix(test_y, predictions)
score = f1_score(test_y, predictions, pos_label=POSITIVE)
scores.append(score)

print('Total emails classified:', len(data))
print('Score:', sum(scores)/len(scores))
print('Confusion matrix:')
print(confusion)

以下是我收到的警告消息的示例:

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
'precision', 'predicted', average, warn_for)
('Total emails classified:', 75)
('Score:', 0.025000000000000001)
Confusion matrix:
[[39 35]
[46 24]]

最佳答案

查看您对训练-测试拆分的每次迭代的预测。因为该警告意味着当测试集中的某些样本为阳性时,您的算法将所有测试样本标记为阴性(也许其中只有 1 个为阳性,但无论如何它都会引发该警告)。

还要查看您对数据集的拆分,因为某些测试拆分可能仅包含 1 个正样本,但您的分类器对其进行了错误分类。

例如,在这种情况下它会引发该警告(为了弄清楚代码中发生了什么):

from sklearn.metrics import f1_score

# here we have only 4 labels of 4 samples
f1_score([0,0,1,0],[0,0,0,0])
/usr/local/lib/python3.4/dist-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
'precision', 'predicted', average, warn_for)

关于python - 如何在 scikit-learn 中正确加载文本数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34231201/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com