作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我在让朴素贝叶斯分类器与文档术语矩阵一起工作时遇到了一些非常烦人的问题。我确定我犯了一个非常简单的错误,但无法弄清楚它是什么。我的数据来自帐户电子表格。我被要求弄清楚哪些类别(以文本格式:主要是部门名称或预算名称)更有可能将钱花在慈善事业上,哪些主要(或仅)花在私营公司上。他们建议我使用朴素贝叶斯分类器来做到这一点。我有大约一千行数据来训练模型和数十万行来测试模型。我已经准备好字符串,用下划线和 ands/&s 替换空格,然后将每个类别视为一个术语:因此“酒精和药物成瘾”变为:酒精 + 药物成瘾。
一些示例行:
"environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable" -> This row went to a charity
"west_north_west customer+tenancy premises h.r.a._special_maintenance" -> This row went to a private company.
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
testmatrix <- t(TermDocumentMatrix(testcorpus))
}
trainmatrix <- getMatrix(traindata$cats)
testmatrix <- getMatrix(testdata$cats)
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$Code))
rs<- predict(model, as.matrix(testdata$cats))
output:
rs 1 2
1 0 0
2 19 17
最佳答案
我自己已经遇到了这个问题。您已经完成(据我所知)一切正确,e1071(以及 klar)中的朴素贝叶斯实现是错误的。
但是有一个简单快捷的解决方法,以便在 e1071 中实现的朴素贝叶斯再次起作用:您应该将文本向量更改为分类变量,即 as.factor
.您已经使用目标变量 traindata$Code
完成了此操作。 ,但您还必须为您的 trainmatrix
执行此操作然后肯定是你的testdata
.
我无法将错误跟踪到 100%,但它位于 e1071 的朴素贝叶斯实现中的这一部分(我可能会注意到,klar 只是 e1071 的包装器):
L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
function(v) {
nd <- ndata[attribs[v]]
## nd is now a cell, row i, column attribs[v]
if (is.na(nd) || nd == 0) {
rep(1, length(object$apriori))
} else {
prob <- if (isnumeric[attribs[v]]) {
## we select table for attribute
msd <- object$tables[[v]]
## if stddev is eqlt eps, assign threshold
msd[, 2][msd[, 2] <= eps] <- threshold
dnorm(nd, msd[, 1], msd[, 2])
} else {
object$tables[[v]][, nd]
}
prob[prob <= eps] <- threshold
prob
}
})), 1, sum)
threshold
因此,具有较高先验因子的变量将始终“获胜”。
## i have changed the vectors slightly
first <- "environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable"
second <- "west_north_west customer+tenancy premises h.r.a._special_maintenance"
categories <- c("charity", "private")
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
## testmatrix <- t(TermDocumentMatrix(testcorpus))
## instead just use DocumentTermMatrix, the assignment is superflous
return(DocumentTermMatrix(testcorpus))
}
## since you did not supply some more data, I cannot do anything about these lines
## trainmatrix <- getMatrix(traindata$cats)
## testmatrix <- getMatrix(testdata$cats)
## instead only
trainmatrix <- getMatrix(c(first, second))
## I prefer running this instead of as.matrix as i can add categories more easily
traindf <- data.frame(categories, as.data.frame(inspect(trainmatrix)))
## now transform everything to a character vector since factors produce an error
for (cols in names(traindf[-1])) traindf[[cols]] <- factor(traindf[[cols]])
## traindf <- apply(traindf, 2, as.factor) did not result in factors
## check if it's as we wished
str(traindf)
## it is
## let's create a model (with formula syntax)
model <- naiveBayes(categories~., data=traindf)
## if you look at the output (doubled to see it more clearly)
predict(model, newdata=rbind(traindf[-1], traindf[-1]))
model$tables$premises
将为您提供向私有(private)公司提供资金的场所的可能性:100 %。
eps = 0
和
threshold = 0.000001
可以使用。
关于r - 朴素贝叶斯分类器的文档术语矩阵 : unexpected results R,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21163207/
我是一名优秀的程序员,十分优秀!