gpt4 book ai didi

r - 如何从词云中删除词?

转载 作者:行者123 更新时间:2023-12-04 10:13:59 25 4
gpt4 key购买 nike

我正在使用 R 中的 wordcloud 包以及“Word Cloud in R”的帮助创建一个 wordcloud。

我可以很容易地做到这一点,但我想从这个词云中删除单词。我在一个文件中有单词(实际上是一个 excel 文件,但我可以更改它),我想排除所有这些单词,其中有几百个。有什么建议吗?

require(XML)
require(tm)
require(wordcloud)
require(RColorBrewer)
ap.corpus=Corpus(DataframeSource(data.frame(as.character(data.merged2[,6]))))
ap.corpus=tm_map(ap.corpus, removePunctuation)
ap.corpus=tm_map(ap.corpus, tolower)
ap.corpus=tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
ap.tdm=TermDocumentMatrix(ap.corpus)
ap.m=as.matrix(ap.tdm)
ap.v=sort(rowSums(ap.m),decreasing=TRUE)
ap.d=data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)

最佳答案

@Tyler Rinker 已经给出了答案,只需添加另一行removeWords(),但这里有更多细节。

假设您的 excel 文件名为 nuts.xls,并且有一列这样的单词

stopwords
peanut
cashew
walnut
almond
macadamia

R 中你可以这样进行

     library(gdata) # package with xls import function
library(tm)
# now load the excel file with the custom stoplist, note a few of the arguments here
# to clean the data by removing spaces that excel seems to insert and prevent it from
# importing the characters as factors. You can use any args from read.table(), which is
# handy
nuts<-read.xls("nuts.xls", header=TRUE, stringsAsFactor=FALSE, strip.white=TRUE)

# now make some words to build a corpus to test for a two-step stopword removal process...
words1<- c("peanut, cashew, walnut, macadamia, apple, pear, orange, lime, mandarin, and, or, but")
words2<- c("peanut, cashew, walnut, almond, apple, pear, orange, lime, mandarin, if, then, on")
words3<- c("peanut, walnut, almond, macadamia, apple, pear, orange, lime, mandarin, it, as, an")
words.all<-data.frame(rbind(words1,words2,words3))
words.corpus<-Corpus(DataframeSource((words.all)))

# now remove the standard list of stopwords, like you've already worked out
words.corpus.nostopwords <- tm_map(words.corpus, removeWords, stopwords("english"))
# now remove the second set of stopwords, this time your custom set from the excel file,
# note that it has to be a reference to a character vector containing the custom stopwords
words.corpus.nostopwords <- tm_map(words.corpus.nostopwords, removeWords, nuts$stopwords)

# have a look to see if it worked
inspect(words.corpus.nostopwords)
A corpus with 3 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID

$words1
, , , , apple, pear, orange, lime, mandarin, , ,

$words2
, , , , apple, pear, orange, lime, mandarin, , ,

$words3
, , , , apple, pear, orange, lime, mandarin, , ,

成功了!标准停用词消失了,Excel 文件中自定义列表中的单词也消失了。毫无疑问,还有其他方法可以做到这一点。

关于r - 如何从词云中删除词?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8619941/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com