gpt4 book ai didi

matlab - 多类朴素贝叶斯分类器 : Getting Same Error Rate

转载 作者:行者123 更新时间:2023-11-30 09:38:41 25 4
gpt4 key购买 nike

我已经实现了多类朴素贝叶斯分类器,但问题是当我增加训练数据集时,我的错误率是相同的。我一直在调试这个问题,但无法弄清楚为什么会发生这种情况。所以我想我会把它发布在这里,看看我是否做错了什么。

%Naive Bayse Classifier
%This function split data to 80:20 as data and test, then from 80
%We use incremental 5,10,15,20,30 as the test data to understand the error
%rate.
%Goal is to compare the plots in stanford paper
%http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf

function[tPercent] = naivebayes(file, iter, percent)
dm = load(file);
for i=1:iter

%Getting the index common to test and train data
idx = randperm(size(dm.data,1))

%Using same idx for data and labels
shuffledMatrix_data = dm.data(idx,:);
shuffledMatrix_label = dm.labels(idx,:);

percent_data_80 = round((0.8) * length(shuffledMatrix_data));


%Doing 80-20 split
train = shuffledMatrix_data(1:percent_data_80,:);

test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);

%Getting the label data from the 80:20 split
train_labels = shuffledMatrix_label(1:percent_data_80,:);

test_labels = shuffledMatrix_label(percent_data_80+1:length(shuffledMatrix_data),:);

%Getting the array of percents [5 10 15..]
percent_tracker = zeros(length(percent), 2);

for pRows = 1:length(percent)

percentOfRows = round((percent(pRows)/100) * length(train));
new_train = train(1:percentOfRows,:);
new_train_label = train_labels(1:percentOfRows);

%get unique labels in training
numClasses = size(unique(new_train_label),1);
classMean = zeros(numClasses,size(new_train,2));
classStd = zeros(numClasses, size(new_train,2));
priorClass = zeros(numClasses, size(2,1));

% Doing the K class mean and std with prior
for kclass=1:numClasses
classMean(kclass,:) = mean(new_train(new_train_label == kclass,:));
classStd(kclass, :) = std(new_train(new_train_label == kclass,:));
priorClass(kclass, :) = length(new_train(new_train_label == kclass))/length(new_train);
end

error = 0;

p = zeros(numClasses,1);

% Calculating the posterior for each test row for each k class
for testRow=1:length(test)
c=0; k=0;
for class=1:numClasses
temp_p = normpdf(test(testRow,:),classMean(class,:), classStd(class,:));
p(class, 1) = sum(log(temp_p)) + (log(priorClass(class)));
end
%Take the max of posterior
[c,k] = max(p(1,:));
if test_labels(testRow) ~= k
error = error + 1;
end
end
avgError = error/length(test);
percent_tracker(pRows,:) = [avgError percent(pRows)];
tPercent = percent_tracker;
plot(percent_tracker)
end
end
end

这是我的数据的维数

x = 

data: [768x8 double]
labels: [768x1 double]

我正在使用 UCI 的 Pima 数据集

最佳答案

您实现训练数据本身的结果是什么?完全适合吗?

很难确定,但我注意到以下几点:

  1. 每个类(class)都拥有训练数据非常重要。如果没有训练数据,你就无法真正训练分类器来识别类别。
  2. 如果可能,训练示例的数量不应偏向某些类(class)。例如,如果在 2 类分类中,第 1 类的训练和交叉验证示例数量仅占数据的 5%,则始终返回第 2 类的函数的误差将为 5%。您是否尝试过分别检查准确率和召回率?
  3. 您尝试将正态分布拟合到类中的每个特征,然后将其用于后验概率。我不确定它在平滑方面的表现如何。您能否尝试通过简单的计数重新实现它,看看是否会给出不同的结果?
  4. 也可能是特征高度冗余,并且贝叶斯方法超出了概率。

关于matlab - 多类朴素贝叶斯分类器 : Getting Same Error Rate,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12763829/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com