gpt4 book ai didi

tm - stemCompletion无法正常工作

转载 作者:行者123 更新时间:2023-12-04 05:12:23 29 4
gpt4 key购买 nike

我正在使用tm包对修复数据进行文本分析,将数据读取到数据框中,转换为Corpus对象,并应用了各种方法来使用Lower,stipWhitespace,removestopwords等清除数据。

取回Corpus对象进行stemCompletion。

使用tm_map函数执行了stemDocument,我的目标词被阻止了

取得了预期的结果。

当我使用tm_map函数运行stemCompletion操作时,它不起作用
并低于错误

Error in UseMethod("words") : no applicable method for 'words' applied to an object of class "character"



执行trackback()以显示并获得以下步骤
> traceback()
9: FUN(X[[1L]], ...)
8: lapply(dictionary, words)
7: unlist(lapply(dictionary, words))
6: unique(unlist(lapply(dictionary, words)))
5: FUN(X[[1L]], ...)
4: lapply(X, FUN, ...)
3: mclapply(content(x), FUN, ...)
2: tm_map.VCorpus(c, stemCompletion, dictionary = c_orig)
1: tm_map(c, stemCompletion, dictionary = c_orig)

如何解决此错误?

最佳答案

使用tm v0.6时收到相同的错误。我怀疑发生这种情况,因为stemCompletion不在此版本tm软件包的默认转换中:

>  getTransformations
function ()
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument",
"stripWhitespace")
<environment: namespace:tm>

现在, tolower函数具有相同的问题,但是可以通过使用 content_transformer函数使其可操作。我对 stemCompletion尝试了类似的方法,但未成功。

请注意,即使 stemCompletion不是默认转换,但在手动输入词干词时仍可以使用:
> stemCompletion("compani",dictCorpus)
compani
"companies"

为了继续我的工作,我手动用空格分隔了一个语料库中的每个文档,通过 stemCompletion馈送它们,然后将它们与以下(笨拙而不优雅的函数)连接在一起:
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

其中 dictCorpus只是清理后的语料库的副本,但在阻止之前。额外的 stripWhitespace是特定于我的语料库的,但对于一般语料库则可能是良性的。您可能需要根据需要将 type选项从“最短”更改为。

举一个完整的例子,让我们使用tm包中的 crude数据设置一个虚拟语料库:
> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)

> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
"The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
Reuter

> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today
made light fall oil product price weak crude oil market compani spokeswoman said diamond
latest line us oil compani cut contract post price last two day cite weak oil market reuter

> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today
made light fall oil product price weak crude oil market companies spokeswoman said diamond
latest line us oil companies cut contract posted price last two day cited weak oil market reuter

注意:这个例子很奇怪,因为在这个过程中,拼写错误的单词“copany”被映射为:->“copani”->“NA”。不确定如何纠正此问题...

要在整个语料库中运行 stemCompletion_mod,我只需要使用 sapply(或带snow软件包的 parSapply)。

也许比我有更多经验的人可以建议更简单的修改,以使 stemCompletion在tm软件包的v0.6中工作。

关于tm - stemCompletion无法正常工作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25206049/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com