r - 朴素贝叶斯分类器的文档术语矩阵 : unexpected results R-6ren

r - 朴素贝叶斯分类器的文档术语矩阵 : unexpected results R

转载作者：行者123 更新时间：2023-12-02 09:51:18

我在让朴素贝叶斯分类器与文档术语矩阵一起工作时遇到了一些非常烦人的问题。我确定我犯了一个非常简单的错误，但无法弄清楚它是什么。我的数据来自帐户电子表格。我被要求弄清楚哪些类别(以文本格式:主要是部门名称或预算名称)更有可能将钱花在慈善事业上，哪些主要(或仅)花在私营公司上。他们建议我使用朴素贝叶斯分类器来做到这一点。我有大约一千行数据来训练模型和数十万行来测试模型。我已经准备好字符串，用下划线和 ands/&s 替换空格，然后将每个类别视为一个术语:因此“酒精和药物成瘾”变为:酒精 + 药物成瘾。

一些示例行:

"environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable" -> This row went to a charity
"west_north_west customer+tenancy premises h.r.a._special_maintenance" -> This row went to a private company.

使用 this example作为模板，我编写了以下函数来生成我的文档术语矩阵(使用 tm )，用于训练和测试数据。

library(tm)
library(e1071) 

getMatrix <- function(chrVect){
    testsource <- VectorSource(chrVect)
    testcorpus <- Corpus(testsource)
    testcorpus <- tm_map(testcorpus,stripWhitespace)
    testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
    testmatrix <- t(TermDocumentMatrix(testcorpus))
}

trainmatrix <- getMatrix(traindata$cats)
testmatrix <- getMatrix(testdata$cats)

到现在为止还挺好。问题是当我尝试 a) 应用朴素贝叶斯模型和 b) 从该模型进行预测时。使用 克拉尔包 - 我得到一个零概率错误，因为许多术语都有一个类别的零实例，并且玩弄拉普拉斯术语似乎无法解决这个问题。使用 e1071 ，该模型有效，但是当我使用以下方法测试模型时:

model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$Code))
rs<- predict(model, as.matrix(testdata$cats))

...每个项目都预测相同的类别，即使它们应该大致相等。模型中的某些东西显然不起作用。查看 model$tables 中的一些术语 - 我可以看到许多具有较高的私有(private)值和零值，反之亦然。我已将 as.factor 用于代码。

output:
rs   1  2
  1  0  0
  2 19  17

关于出了什么问题的任何想法？ dtm 矩阵与 naivebayes 不兼容吗？我是否错过了准备数据的一步？我完全没有想法。希望这一切都清楚。如果没有，很高兴澄清。任何建议将不胜感激。

最佳答案

我自己已经遇到了这个问题。您已经完成(据我所知)一切正确，e1071(以及 klar)中的朴素贝叶斯实现是错误的。

但是有一个简单快捷的解决方法，以便在 e1071 中实现的朴素贝叶斯再次起作用:您应该将文本向量更改为分类变量，即 as.factor .您已经使用目标变量 traindata$Code 完成了此操作。，但您还必须为您的 trainmatrix 执行此操作然后肯定是你的testdata .

我无法将错误跟踪到 100%，但它位于 e1071 的朴素贝叶斯实现中的这一部分(我可能会注意到，klar 只是 e1071 的包装器):

L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
            function(v) {
                nd <- ndata[attribs[v]]
                ## nd is now a cell, row i, column attribs[v]
                if (is.na(nd) || nd == 0) {
                    rep(1, length(object$apriori))
                } else {
                    prob <- if (isnumeric[attribs[v]]) {
                        ## we select table for attribute
                        msd <- object$tables[[v]]
                        ## if stddev is eqlt eps, assign threshold
                        msd[, 2][msd[, 2] <= eps] <- threshold
                        dnorm(nd, msd[, 1], msd[, 2])
                    } else {
                        object$tables[[v]][, nd]
                    }
                    prob[prob <= eps] <- threshold
                    prob
                }
            })), 1, sum)

您会看到有一个 if-else 条件:如果我们没有数字，则使用朴素贝叶斯，因为我们期望它可以工作。如果我们有数字——这就是错误——这个朴素的贝叶斯会自动假设一个正态分布。如果您的文本中只有 0 和 1，那么 dnorm 非常糟糕。我假设由于 dnorm 概率创建的值非常低。总是替换为 threshold因此，具有较高先验因子的变量将始终“获胜”。

但是，如果我正确理解您的问题，您甚至不需要预测，而是确定哪个部门向谁提供资金的先验因素。然后，您所要做的就是深入了解您的模型。在您的每个术语的模型中，都会出现先验概率，这就是我假设您正在寻找的。让我们使用稍微修改过的示例版本来完成上述操作:

## i have changed the vectors slightly
first <- "environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable"
second <- "west_north_west customer+tenancy premises h.r.a._special_maintenance"

categories <- c("charity", "private")

library(tm)
library(e1071)

getMatrix <- function(chrVect){
    testsource <- VectorSource(chrVect)
    testcorpus <- Corpus(testsource)
    testcorpus <- tm_map(testcorpus,stripWhitespace)
    testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
    ## testmatrix <- t(TermDocumentMatrix(testcorpus))
    ## instead just use DocumentTermMatrix, the assignment is superflous
    return(DocumentTermMatrix(testcorpus))
}

## since you did not supply some more data, I cannot do anything about these lines
## trainmatrix <- getMatrix(traindata$cats)
## testmatrix <- getMatrix(testdata$cats)
## instead only
trainmatrix <- getMatrix(c(first, second))

## I prefer running this instead of as.matrix as i can add categories more easily
traindf <- data.frame(categories, as.data.frame(inspect(trainmatrix)))

## now transform everything to a character vector since factors produce an error
for (cols in names(traindf[-1])) traindf[[cols]] <- factor(traindf[[cols]])
## traindf <- apply(traindf, 2, as.factor) did not result in factors

## check if it's as we wished
str(traindf)

## it is
## let's create a model  (with formula syntax)
model <- naiveBayes(categories~., data=traindf)

## if you look at the output (doubled to see it more clearly)
predict(model, newdata=rbind(traindf[-1], traindf[-1]))

但正如我已经说过的，你不需要预测。看看模型就可以了，例如 model$tables$premises将为您提供向私有(private)公司提供资金的场所的可能性:100 %。

如果您正在处理非常大的数据集，则应在模型中指定阈值和 eps。当应提供阈值时，Eps 定义了限制。例如。 eps = 0和 threshold = 0.000001可以使用。

此外，您应该坚持使用词频加权。 tf*idv 例如由于朴素贝叶斯中的 dnorm 将无法工作。

希望我终于可以得到我的 50 声望 :P

关于r - 朴素贝叶斯分类器的文档术语矩阵 : unexpected results R，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21163207/

文章推荐： c++ - 使用其类型而不是其实例调用不捕获

文章推荐： r - R 向量/数据帧中的基本滞后

文章推荐： c++ - UDP recvfrom()导致问题

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

r - 朴素贝叶斯分类器的文档术语矩阵 : unexpected results R