gpt4 book ai didi

r - 如何使用 stemCompletion 函数(tm 包)从字典中完成词干语料库

转载 作者:行者123 更新时间:2023-12-04 09:47:42 25 4
gpt4 key购买 nike

我在 R 的 tm 包中遇到问题。我使用的是 0.6.2 版本。已回答以下问题(2 个不同的错误)herehere但在使用发布的解决方案后仍然产生错误。请点击here下载数据集(仅 93 行)。这是一个可重现的例子。两个错误如下:

  1. (已解决) UseMethod("meta", x) 错误:没有适用于“元”的适用方法应用于“字符”类的对象

  2. 错误:inherits(doc, "TextDocument") 不是 TRUE

  3. tm_map(ds.corpus, PlainTextDocument) 在这种情况下不会创建纯文本文档。inherits(ds.cleanCorpus, "TextDocument") # 返回 FALSE

请告诉我我的方法有什么问题。

--

  # Data import
df.imp<- read.csv("Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)

##### Data Pre-Processing

install.packages("tm")
require(tm)

ds.corpus<- Corpus(VectorSource(df.imp$Content))

ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
ds.corpus<- tm_map(ds.corpus,removeURL)

stopwords.default<- stopwords("english")
stopWordsNotDeleted<- c("isn't" , "aren't" , "wasn't" , "weren't" , "hasn't" ,
"haven't" , "hadn't" , "doesn't" , "don't" ,"didn't" ,
"won't" , "wouldn't", "shan't" , "shouldn't", "can't" ,
"cannot" , "couldn't" , "mustn't", "but","no", "nor", "not", "too", "very")

stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )

copy<- ds.corpus ## creating a copy to be used as a dictionary

ds.corpus<- tm_map(ds.corpus, stemDocument)

## error Statement #1
ds.corpus<- stemCompletion(ds.corpus, dictionary = copy)
## Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"




ds.cleanCorpus<- tm_map(ds.corpus, PlainTextDocument) ## creating plain text document

class(ds.cleanCorpus) ## output is VCorpus" "Corpus". what it should be??

## error Statement #2
tdm<- TermDocumentMatrix(ds.corpus) ## creating term document matrix

inherits(ds.cleanCorpus, "TextDocument") ## returns FALSE

更新:找出第一个错误,stemCompletion 方法的 x 参数应该是字符向量,而字典可以是语料库或字符向量。然而,当我在 ds.corpus 的第一个文档(字符向量)上尝试时,如下所示,词干词没有完成,输出只是像以前一样的词干字符向量。

stemCompletion(ds.corpus[[1]]$content, dictionary = copy) 

所以现在我的主要问题是“如何从字典(tm 包)中完成词干语料库?”stemCompletion 方法似乎不起作用(在字符向量上)。其次,如何完成整个语料库的词干提取,我是否应该对语料库内容的每个文档使用一个for循环?

最佳答案

有两件事你需要改变

  1. 当你使用自定义函数时你需要使用content_transformer

    removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)

    ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))

  2. 函数 stemCompletion 的目的是尝试完成一个词干词 https://en.wikipedia.org/wiki/Stemming基于字典。词干词需要是字符向量,字典可以是语料库。

    x <- c("compan", "entit", "suppl")stemCompletion(x, 复制)

输出:

 compan       entit       suppl 

“公司”“”“供应”

创建文档术语矩阵的代码

# Data import
df.imp<- read.csv("data/Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)

##### Data Pre-Processing

#install.packages("tm")
require(tm)

ds.corpus<- Corpus(VectorSource(df.imp$Content))

ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))


stopwords.default<- stopwords("english")
stopWordsNotDeleted<- c("isn't" , "aren't" , "wasn't" , "weren't" , "hasn't" ,
"haven't" , "hadn't" , "doesn't" , "don't" ,"didn't" ,
"won't" , "wouldn't", "shan't" , "shouldn't", "can't" ,
"cannot" , "couldn't" , "mustn't", "but","no", "nor", "not", "too", "very")

stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )

tdm<- TermDocumentMatrix(ds.corpus)

完成词干提取的示例

copy<- ds.corpus ## creating a copy to be used as a dictionary
x <- c("compan", "entit", "suppl")
stemCompletion(x, copy)

关于r - 如何使用 stemCompletion 函数(tm 包)从字典中完成词干语料库,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35588882/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com