gpt4 book ai didi

r - R在Lime上解释-存储在`object`和`newdata`中的特征名称不同

转载 作者:行者123 更新时间:2023-12-04 11:10:15 25 4
gpt4 key购买 nike

嗨,我正在研究在LIME模型上使用R进行解释。当我运行此部分时,一切都很好。

# Library
library(tm)
library(SnowballC)
library(caTools)
library(RWeka)
library(caret)
library(text2vec)
library(lime)

# Importing the dataset
dataset_original = read.delim('Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)
dataset_original$Liked = as.factor(dataset_original$Liked)

# Splitting the dataset into the Training set and Test set
set.seed(123)
split = sample.split(dataset_original$Liked, SplitRatio = 0.8)
training_set = subset(dataset_original, split == TRUE)
test_set = subset(dataset_original, split == FALSE)

#Create & clean corpus
#clean corpus function
clean_text <- function(text) {
corpus = VCorpus(VectorSource(text))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)
return(corpus)
}

#ngram function
BigramTokenizer <- function(x){NGramTokenizer(x, Weka_control(min=1,max=2))}

#create dtm
dtm <- function(text){
corpus = VCorpus(VectorSource(text))
dtm = DocumentTermMatrix(corpus, control = list(weighting=weightTfIdf, tokenize=BigramTokenizer))
dataset = as.data.frame(as.matrix(dtm))
dataset = dataset[,order(names(dataset))]
return(dataset)
}

#cleaning train & test text
for (i in seq(nrow(training_set))) {
training_set$clean_text[i] = as.character(clean_text(training_set$Review)[[i]])
print(i)
}

for (i in seq(nrow(test_set))) {
test_set$clean_text[i] = as.character(clean_text(test_set$Review)[[i]])
print(i)
}

#Create document term matrix
dataset_train <- dtm(training_set$clean_text)
dataset_test <- dtm(test_set$clean_text)

#Drop new words in test set & ensure same number of columns as train set
test_colname <- colnames(dataset_test)[colnames(dataset_test) %in% colnames(dataset_train)]
test_colname <- test_colname[!is.na(test_colname)] #Remove NA
new_test_colname <- colnames(dataset_train)[!(colnames(dataset_train) %in% test_colname)] #Columns in train not in test
dataset_test <- dataset_test[,test_colname]
dataset_test[new_test_colname] <- 0
dataset_test = dataset_test[,order(names(dataset_test))]

dataset_train = as.matrix(dataset_train)
dataset_test = as.matrix(dataset_test)

#xgboost caret model
set.seed(123)
model <- train(dataset_train, training_set$Liked, method="xgbTree")
predict(model, newdata=dataset_test)


但是,当我运行此部分时:

######
#LIME#
######
explainer <- lime(training_set$Review, model, preprocess = dtm)
explanation <- explain(training_set$Review[1], explainer, n_labels = 1, n_features = 5)
plot_features(explanation)


它说:

 Error in predict.xgb.Booster(modelFit, newdata) : 
Feature names stored in `object` and `newdata` are different!


在运行此代码之前,我确保我的训练和测试数据具有相同的列名和编号。我也环顾四周,发现我的问题与这篇文章相似,但是我仍然不了解与此相关的链接。
R: LIME returns error on different feature numbers when it's not the case

我花了数周时间进行此工作并在线搜索,但无济于事,因此,非常感谢您对我应该做的任何帮助或指导!



我的资料:

数据集: https://drive.google.com/file/d/1-pzY7IQVyB_GmT5dT0yRx3hYzOFGrZSr/view?usp=sharing

最佳答案

将xgboost软件包从v0.6.xxx更新到v0.7.xxx时,我遇到了同样的问题。

我解决了这一问题,确保不仅训练和测试集中的列名称相同,而且列的顺序相同。

希望这对您有用。

关于r - R在Lime上解释-存储在`object`和`newdata`中的特征名称不同,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51296577/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com