gpt4 book ai didi

从 R 中用户定义的语料库中删除停用词

转载 作者:行者123 更新时间:2023-12-04 18:08:03 26 4
gpt4 key购买 nike

我有一组文件:

documents = c("She had toast for breakfast",
"The coffee this morning was excellent",
"For lunch let's all have pancakes",
"Later in the day, there will be more talks",
"The talks on the first day were great",
"The second day should have good presentations too")

在这组文档中,我想删除停用词。我已经删除了标点符号并转换为小写,使用:
documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation

首先我转换为一个语料库对象:
documents <- Corpus(VectorSource(documents))

然后我尝试删除停用词:
documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords

但是最后一行导致以下错误:

THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC () 进行调试。

这已经被问过 here 但没有给出答案。这个错误是什么意思?

编辑

是的,我正在使用 tm 包。

这是 sessionInfo() 的输出:

R 版本 3.0.2 (2013-09-25)
平台:x86_64-apple-darwin10.8.0(64位)

最佳答案

当我遇到 tm 问题时,我通常最终只是编辑原始文本。

删除单词有点尴尬,但您可以将 tm 的停用词列表中的正则表达式粘贴在一起。

stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')

> documents
[1] " toast breakfast" " coffee morning excellent"
[3] " lunch lets pancakes" "later day will talks"
[5] " talks first day great" " second day good presentations "

关于从 R 中用户定义的语料库中删除停用词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37526550/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com