gpt4 book ai didi

machine-learning - 记录实现朴素贝叶斯进行文本分类的可能性

转载 作者:行者123 更新时间:2023-11-30 08:35:30 28 4
gpt4 key购买 nike

我正在实现朴素贝叶斯算法来进行文本分类。我有大约 1000 个用于培训的文档和 400 个用于测试的文档。我认为我已经正确实现了培训部分,但我在测试部分感到困惑。以下是我所做的简要说明:

在我的训练函数中:

vocabularySize= GetUniqueTermsInCollection();//get all unique terms in the entire collection

spamModelArray[vocabularySize];
nonspamModelArray[vocabularySize];

for each training_file{
class = GetClassLabel(); // 0 for spam or 1 = non-spam
document = GetDocumentID();

counterTotalTrainingDocs ++;

if(class == 0){
counterTotalSpamTrainingDocs++;
}

for each term in document{
freq = GetTermFrequency; // how many times this term appears in this document?
id = GetTermID; // unique id of the term

if(class = 0){ //SPAM
spamModelArray[id]+= freq;
totalNumberofSpamWords++; // total number of terms marked as spam in the training docs
}else{ // NON-SPAM
nonspamModelArray[id]+= freq;
totalNumberofNonSpamWords++; // total number of terms marked as non-spam in the training docs
}
}//for


for i in vocabularySize{
spamModelArray[i] = spamModelArray[i]/totalNumberofSpamWords;
nonspamModelArray[i] = nonspamModelArray[i]/totalNumberofNonSpamWords;

}//for


priorProb = counterTotalSpamTrainingDocs/counterTotalTrainingDocs;// calculate prior probability of the spam documents
}

我认为我正确理解并实现了培训部分,但我不确定我是否可以正确实现测试部分。在这里,我尝试检查每个测试文档,并计算每个文档的 logP(spam|d) 和 logP(non-spam|d)。然后我比较这两个数量以确定类别(垃圾邮件/非垃圾邮件)。

在我的测试函数中:

vocabularySize= GetUniqueTermsInCollection;//get all unique terms in the entire collection
for each testing_file:
document = getDocumentID;

logProbabilityofSpam = 0;
logProbabilityofNonSpam = 0;

for each term in document{
freq = GetTermFrequency; // how many times this term appears in this document?
id = GetTermID; // unique id of the term

// logP(w1w2.. wn) = C(wj)∗logP(wj)
logProbabilityofSpam+= freq*log(spamModelArray[id]);
logProbabilityofNonSpam+= freq*log(nonspamModelArray[id]);
}//for

// Now I am calculating the probability of being spam for this document
if (logProbabilityofNonSpam + log(1-priorProb) > logProbabilityofSpam +log(priorProb)) { // argmax[logP(i|ck) + logP(ck)]
newclass = 1; //not spam
}else{
newclass = 0; // spam
}

}//for

我的问题是;我想返回每个类别的概率,而不是精确的 1 和 0(垃圾邮件/非垃圾邮件)。我想看看例如newclass = 0.8684212 所以我可以稍后应用阈值。但我在这里很困惑。如何计算每个文档的概率?我可以使用 logProbabilities 来计算它吗?

最佳答案

由属于 < 类的一组特征 {F1, F2, ..., Fn} 描述的数据的概率em>C,根据朴素贝叶斯概率模型,为

P(C|F) = P(C) * (P(F1|C) * P(F2|C) * ... * P(Fn|C)) / P(F1, ..., Fn)

您拥有除 1/P( F1, ..., Fn) 之外的所有项(以对数形式)术语,因为您正在实现的朴素贝叶斯分类器中未使用该术语。 (严格来说,MAP 分类器。)

您还必须收集这些特征的频率,并从中计算

P(F1, ..., Fn) = P(F1) * ... * P(Fn)

关于machine-learning - 记录实现朴素贝叶斯进行文本分类的可能性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5451004/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com