gpt4 book ai didi

r - 应用 tm 方法时一个变量的多个结果 "stemCompletion"

转载 作者:行者123 更新时间:2023-12-01 12:39:36 24 4
gpt4 key购买 nike

我有一个语料库,其中包含 3 个变量(ID、标题、摘要)的 15 个观察结果的日志数据。我使用 R Studio 从 .csv 文件中读取数据(每次观察一行)。在执行一些文本挖掘操作时,我在使用方法 stemCompletion 时遇到了一些麻烦。应用 stemCompletion 后,我观察到 .csv 的每个词干行都提供了 3 次结果。所有其他 tm 方法(例如 stemDocument)仅产生一个结果。我想知道为什么会发生这种情况以及如何解决这个问题

我使用了下面的代码:

data.corpus <- Corpus(DataframeSource(data))  
data.corpuscopy <- data.corpus
data.corpus <- tm_map(data.corpus, stemDocument)
data.corpus <- tm_map(data.corpus, stemCompletion, dictionary=data.corpuscopy)

应用stemDocument后的单个结果是例如

"> data.corpus[[1]]

physic environ sourc innov investig attribut innov space
investig physic space intersect innov innov relev attribut physic space innov reflect chang natur innov technolog advanc servic mean chang argu develop innov space similar embodi divers set valu collabor open sustain use literatur review interview benchmark examin relationship physic environ innov literatur review interview underlin innov communic human centr process result five attribut innov space present collabor enabl modifi smart attract reflect provid perspect challeng support innov creation develop physic space add conceptu develop innov space outlin physic space innov servic"

在使用 stemCompletion 之后,结果出现了 3 次:

"$`1`
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service"

下面是一个可重现的例子:

包含三个变量的三个观察值的 .csv 文件:

ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations

下面是我用过的词干提取方法

data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])

corpus <- Corpus(DataframeSource(data))
corpuscopy <- corpus
corpus <- tm_map(corpus, stemDocument)
corpus[[1]]

corpus <- tm_map(corpus, stemCompletion, dictionary=corpuscopy)
inspect(corpus[1:3])

在我看来,这取决于 .csv 中使用的变量数量,但我不知道为什么。

最佳答案

stemCompletion 函数似乎有些奇怪。在 tm 0.6 版本中如何使用 stemCompletion 并不明显。有一个很好的解决方法 here我已将其用于此答案。

首先,制作您拥有的 CSV 文件:

dat <- read.csv2( text = 
"ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations")

write.csv2(dat, "Test.csv", row.names = FALSE)

读入,转换为语料库,并提取单词:

data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])

corpus <- Corpus(DataframeSource(data))
corpuscopy <- corpus
library(SnowballC)
corpus <- tm_map(corpus, stemDocument)

看看它是否有效:

inspect(corpus)

<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
1
Below is the first titl
Innovat and Knowledg Manag

[[2]]
<<PlainTextDocument (metadata: 7)>>
2
And now the second Titl
Organiz Perform and Learn are veri import

[[3]]
<<PlainTextDocument (metadata: 7)>>
3
The third titl
Knowledg play an import rule in organ

这是让 stemCompletion 工作的好方法:

stemCompletion_mod <- function(x,dict=corpuscopy) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

检查输出以查看词干是否已完成:

lapply(corpus, stemCompletion_mod)

[[1]]
<<PlainTextDocument (metadata: 7)>>
1 Below is the first title Innovation and Knowledge Management

[[2]]
<<PlainTextDocument (metadata: 7)>>
2 And now the second Title Organizational Performance and Learning are NA important

[[3]]
<<PlainTextDocument (metadata: 7)>>
3 The third title Knowledge plays an important rule in organizations

成功!

关于r - 应用 tm 方法时一个变量的多个结果 "stemCompletion",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26204656/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com