r - R e1071 朴素贝叶斯中的错误？

转载作者：行者123 更新时间：2023-11-30 09:12:58

我在 R 社区没有经验，所以如果这不是合适的论坛，请指出我在其他地方...

长话短说，恐怕 e1071::naiveBayes 更喜欢按字母顺序给出标签。

在之前的问题中here我注意到朴素贝叶斯的 e1071 实现中数值预测器有一些奇怪的行为。虽然我得到了更合理的答案，但有些概率似乎偏高。

谁能解释一下为什么这个模拟最终会变成这样？我现在只能想象这是一个错误......

library(e1071)

# get a data frame with numObs rows, and numDistinctLabels possible labels
# each label is randomly drawn from letters a-z
# each label has its own distribution of a numeric variable
# this is normal(i*100, 10), i in 1:numDistinctLabels
# so, if labels are t, m, and q, t is normal(100, 10), m is normal(200, 10), etc
# the idea is that all labels should be predicted just as often
# but it seems that "a" will be predicted most, "b" second, etc

doExperiment = function(numObs, numDistinctLabels){
    possibleLabels = sample(letters, numDistinctLabels, replace=F)
    someFrame = data.frame(
        x=rep(NA, numObs),
        label=rep(NA, numObs)
    )
    numObsPerLabel = numObs / numDistinctLabels
    for(i in 1:length(possibleLabels)){
        label = possibleLabels[i]
        whichAreNA = which(is.na(someFrame$label))
        whichToSet = sample(whichAreNA, numObsPerLabel, replace=F)
        someFrame[whichToSet, "label"] = label
        someFrame[whichToSet, "x"] = rnorm(numObsPerLabel, 100*i, 10)
    }
    someFrame = as.data.frame(unclass(someFrame))
    fit = e1071::naiveBayes(label ~ x, someFrame)
    # The threshold argument doesn't seem to change the matter...
    someFrame$predictions = predict(fit, someFrame, threshold=0)
    someFrame
}

# given a labeled frame, return the label that was predicted most
getMostFrequentPrediction = function(labeledFrame){
    names(which.max(sort(table(labeledFrame$prediction))))
}

# run the experiment a few thousand times
mostPredictedClasses = sapply(1:2000, function(x) getMostFrequentPrediction(doExperiment(100, 5)))

# make a bar chart of the most frequently predicted labels
plot(table(mostPredictedClasses))

这给出了如下图:

enter image description here

给每个标签相同的正态分布(即平均值 100，标准差 10)给出:

enter image description here

关于评论中的困惑:

这可能远离了 Stack Overflow 的领域，但无论如何......虽然我希望分类不会那么困惑，但标准偏差的影响对使 pdf 变平有很大作用，并且您可以观察到如果您做得足够多，那么一两个实际上往往占主导地位(在本例中为红色和黑色) .

enter image description here

太糟糕了，我们无法利用所有标准差都相同的知识。

如果在平均值中添加一点噪音，即使仍然存在一些错误分类，它的分布也会变得更加均匀。

enter image description here

最佳答案

问题不在于 naiveBayes，而是您的 getMostFrequentPrediction 函数。即使第一个值有联系，您也只返回一个值。由于您使用的是 table()，因此计数在表中按字母顺序隐式排序。因此，当您获取第一个最大值时，它也将是按字母顺序排列的“最小”值。因此，如果您多次催款:

getMostFrequentPrediction(data.frame(predictions=sample(rep(letters[1:3], 5))))

即使字母“a”、“b”和“c”都出现了 5 次，您总是会得到“a”。

如果您想随机选择最常预测的类别之一，这里是另一种可能的实现

getMostFrequentPrediction = function(labeledFrame){
    tt<-table(labeledFrame$predictions)
    names(sample(tt[tt==max(tt)], 1))
}

这给出

enter image description here

关于r - R e1071 朴素贝叶斯中的错误？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25902670/

文章推荐： javascript - 如何在 Javascript 中获取类值

文章推荐： java - Maven:执行 wsgen 时找不到 tools.jar

文章推荐： javascript - 如何在单个变量下为下拉列表设置多个值？

文章推荐： java - 在 Java 中读取文本文件时进行跟踪

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

r - R e1071 朴素贝叶斯中的错误？

关于评论中的困惑: