gpt4 book ai didi

r - 使用 Rtexttools 库进行机器学习

转载 作者:行者123 更新时间:2023-11-30 08:46:32 26 4
gpt4 key购买 nike

我有以下训练集:

    Text,y
MRR 93345,1
MRR 93434,1
MRR 93554,1
MRR 938900,1
MRR 93970,1
MRR 937899,1
MRR 93868,1
MRR 938769,1
MRR 93930,1
MRR 92325,1
MRR 931932,1
MRR 933922,1
MRR 934390,1
MRR 93204,1
MRR 93023,1
MRR 930982,1
MRR 87678,-1
MRR 87956,-1
MRR 87890,-1
MRR 878770,-1
MRR 877886,-1
MRR 87678367,-1
MRR 8790,-1
MRR 87345,-1
MRR 87149,-1
MRR 873790,-1
MRR 873493,-1
MRR 874303,-1
MRR 874343,-1
MRR 874304,-1
MRR 879034,-1
MRR 879430,-1
MRR 87943,-1
MRR 879434,-1
MRR 871984,-1
MRR 873949,-1

我的代码如下:

# Create the document term matrix
dtMatrix <- create_matrix(data["Text"],language="english", removePunctuation=TRUE, stripWhitespace=TRUE,
toLower=TRUE,
removeStopwords=TRUE,
stemWords=TRUE, removeSparseTerms=.998)

# Configure the training data
container <- create_container(dtMatrix, data$y, trainSize=1:nrow(dtMatrix), virgin=FALSE)
# train a SVM Model
model <- train_model(container, "SVM", kernel="linear" ,cost=1)

# new data
predictionData <- list("MRR 93111")

# create a prediction document term matrix
predMatrix <- create_matrix(predictionData, originalMatrix=dtMatrix,language="english", removePunctuation=TRUE, stripWhitespace=TRUE,
toLower=TRUE,
removeStopwords=TRUE,
stemWords=TRUE, removeSparseTerms=.998)

# create the corresponding container
predSize = length(predictionData);
predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE)

# predict
results <- classify_model(predictionContainer, model)

现在通过使用 train_model 函数,我想预测:当 y=1 时,MRR 93111。这意味着如果字符串以“MRR 93”开头,则输出应为 1,而词干“MRR 87”则给出 -1。实际上它不起作用,因为我得到 MRR 93111 -1 0.5778781

此外,如果我以不同的方式整理训练集,或者如果我针对同一数据集多次运行脚本,结果似乎会发生变化,这对我来说听起来很奇怪。

更新1:dput(数据)

structure(list(Text = structure(c(26L, 28L, 30L, 34L, 36L, 31L, 
32L, 33L, 35L, 21L, 24L, 27L, 29L, 25L, 22L, 23L, 10L, 20L, 14L,
13L, 12L, 11L, 15L, 3L, 1L, 5L, 4L, 7L, 9L, 8L, 16L, 18L, 17L,
19L, 2L, 6L), .Label = c("MRR 87149", "MRR 871984", "MRR 87345",
"MRR 873493", "MRR 873790", "MRR 873949", "MRR 874303", "MRR 874304",
"MRR 874343", "MRR 87678", "MRR 87678367", "MRR 877886", "MRR 878770",
"MRR 87890", "MRR 8790", "MRR 879034", "MRR 87943", "MRR 879430",
"MRR 879434", "MRR 87956", "MRR 92325", "MRR 93023", "MRR 930982",
"MRR 931932", "MRR 93204", "MRR 93345", "MRR 933922", "MRR 93434",
"MRR 934390", "MRR 93554", "MRR 937899", "MRR 93868", "MRR 938769",
"MRR 938900", "MRR 93930", "MRR 93970"), class = "factor"), Y = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, -1L,
-1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L,
-1L, -1L, -1L, -1L, -1L, -1L)), .Names = c("Text", "Y"), class = "data.frame", row.names = c(NA,
-36L))

最佳答案

您的问题是您的代码在单词级别使用训练数据和分类。

> dtMatrix$dimnames$Terms
[1] "87149" "871984" "87345" "873493" "873790" "873949" "874303" "874304" "874343" "87678" "87678367"
[12] "877886" "878770" "87890" "8790" "879034" "87943" "879430" "879434" "87956" "92325" "93023"
[23] "930982" "93111" "931932" "93204" "93345" "933922" "93434" "934390" "93554" "937899" "93868"
[34] "938769" "938900" "93930" "93970" "mrr"

我不太确定 SVM 如何准确地处理这些数字字符串,但它似乎不太关心字符串的 93 部分。将字符串拆分为字符会赋予各个数字更大的权重:

df$Text <- sapply(1:length(df$Text), function(i) paste(unlist(strsplit(df$Text[i], split = "")), collapse = " "))

我使用 df 而不是 data,因为 data 已经是 RTextTools 中的一个对象,并且在运行代码时给我带来了一些问题。创建矩阵时,必须更改最小字长选项。

dtMatrix <- create_matrix(df$Text,language="english", minWordLength=1, #!
removePunctuation=TRUE, stripWhitespace=TRUE,
toLower=TRUE, removeStopwords=TRUE,
stemWords=TRUE, removeSparseTerms=.998)

现在我们得到:

> dtMatrix$dimnames$Terms

[1]“0”“1”“2”“3”“4”“5”“6”“7”“8”“9”“m”“r”

更重要的是:

> results 
SVM_LABEL SVM_PROB
1 1 0.9144185

我最近参加了一个有关 RTextTools 和 SVM 的研讨会,他们评论说,使用 SVM,每次训练模型时都会得到略有不同的结果。我不完全确定为什么,所以我不会尝试解释,但有人向我们推荐了一本名为“R 中应用统计学习简介”的免费书籍来阅读支持向量机。

完整代码如下:

df <- structure(list(Text = structure(c(26L, 28L, 30L, 34L, 36L, 31L, 
32L, 33L, 35L, 21L, 24L, 27L, 29L, 25L, 22L, 23L, 10L, 20L, 14L,
13L, 12L, 11L, 15L, 3L, 1L, 5L, 4L, 7L, 9L, 8L, 16L, 18L, 17L,
19L, 2L, 6L), .Label = c("MRR 87149", "MRR 871984", "MRR 87345",
"MRR 873493", "MRR 873790", "MRR 873949", "MRR 874303", "MRR 874304",
"MRR 874343", "MRR 87678", "MRR 87678367", "MRR 877886", "MRR 878770",
"MRR 87890", "MRR 8790", "MRR 879034", "MRR 87943", "MRR 879430",
"MRR 879434", "MRR 87956", "MRR 92325", "MRR 93023", "MRR 930982",
"MRR 931932", "MRR 93204", "MRR 93345", "MRR 933922", "MRR 93434",
"MRR 934390", "MRR 93554", "MRR 937899", "MRR 93868", "MRR 938769",
"MRR 938900", "MRR 93930", "MRR 93970"), class = "factor"), Y = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, -1L,
-1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L, -1L,
-1L, -1L, -1L, -1L, -1L, -1L)), .Names = c("Text", "Y"), class = "data.frame", row.names = c(NA,
-36L))



df$Text <- as.character(df$Text)
# new data
df[nrow(df)+1,] <- c("MRR 93111","")
df$Text <- sapply(1:length(df$Text), function(i) paste(unlist(strsplit(df$Text[i], split = "")), collapse = " "))

# Create the document term matrix
dtMatrix <- create_matrix(df$Text,language="english", minWordLength=1,
removePunctuation=TRUE, stripWhitespace=TRUE,
toLower=TRUE, removeStopwords=TRUE,
stemWords=TRUE, removeSparseTerms=.998)


dtMatrix$dimnames$Terms
dtMatrix$dimnames$Docs

# Configure the training data
container <- create_container(dtMatrix, df$Y, trainSize=1:36, testSize = 37, virgin=TRUE)

container <- create_container(dtMatrix,
labels=df$Y, trainSize=1:36, testSize = 37, virgin=TRUE)

# train a SVM Model
model <- train_model(container, "SVM",kernel="linear" ,cost=1) ##??

results <- classify_model(container,model)

results

关于r - 使用 Rtexttools 库进行机器学习,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42746226/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com