gpt4 book ai didi

R - 如何调整使用 RTextTools 创建的文本分类器

转载 作者:行者123 更新时间:2023-11-30 09:08:56 26 4
gpt4 key购买 nike

我正在尝试使用 R 中的 RTextTools 库创建文本分类器。训练和测试数据帧的格式相同。它们都由两列组成:第一列是文本,第二列是标签。

到目前为止我的程序的最小可重现示例(替换数据):

# Packages
## Install
install.packages('e1071', 'RTextTools')
## Import
library(e1071)
library(RTextTools)

data.train <- data.frame("content" = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry.", "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."), "label" = c("yes", "yes", "no"))
data.test <- data.frame("content" = c("It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.", "The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English.", "Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy."), "label" = c("no", "yes", "yes"))

# Process training dataset
data.train.dtm <- create_matrix(data.train$content, language = "english", weighting = tm::weightTfIdf, removePunctuation = TRUE, removeNumbers = TRUE, removeSparseTerms = 0, removeStopwords = TRUE, stemWords = TRUE, stripWhitespace = TRUE, toLower = TRUE)
data.train.container <- create_container(data.train.dtm, data.train$label, trainSize = 1:nrow(data.train), virgin = FALSE)

# Create linear SVM model
model.linear <- train_model(data.train.container, "SVM", kernel = "linear", cost = 10, gamma = 1^-2)

# Process testing dataset
data.test.dtm <- create_matrix(data.test$content, originalMatrix = data.train.dtm)
data.test.container <- create_container(data.test.dtm, labels = rep(0, nrow(data.test)), testSize = 1:nrow(data.test), virgin = FALSE)

# Classify testing dataset
model.linear.results <- classify_model(data.test.container, model.linear)
model.linear.results.table <- table(Predicted = model.linear.results$SVM_LABEL, Actual = data.test$label)
model.linear.results.table

到目前为止,我的代码有效,并生成一个表格,将预测值与实际值进行比较。但结果非常不准确,我很清楚模型需要进行微调。

我知道 e1071 库(RTextTools 所基于的)包含一个 tune.svm 函数,用于返回最佳成本和 Gamma 值以产生最佳结果。使用此方法的问题是,tune.svm 函数上的 data 参数需要读入一个数据帧,但由于我正在做一个文本分类器,所以我不仅仅是将一个简单的数据帧读入SVM 而是一个文档项矩阵。

无济于事,我尝试将 DTM 作为数据帧读取,如下所示:

model.tuned <- tune.svm(label~., data = as.data.frame(data.train.dtm), gamma = 10^(-6:-1), cost = 10^(-1:1))

我完全迷失了,任何见解将不胜感激。

最佳答案

您可以查看 train_model 中的代码(在 RStudio 中按 F2),了解它如何使用容器调用 svm() (在您的情况下,数据.train.container)。默认情况下,train_model 使用

  • cross=0(不对训练数据执行交叉验证)
  • cost=100(违反约束的成本)
  • probability=TRUE(模型应允许概率预测)
  • kernel="radial"(用于 SVM 训练的径向内核)

作为参数传递给svm()

要真正回答您的问题,create_container() 返回的容器具有插槽 training_matrixtraining_codes,您可以在下面使用它们:

model.tuned <- tune.svm(x = data.train.container@training_matrix,
y = data.train.container@training_codes,
gamma = 10^(-6:-1),
cost = 10^(-1:1),
# fill in any other SVM params as needed here
)

关于R - 如何调整使用 RTextTools 创建的文本分类器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45381704/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com