gpt4 book ai didi

r - 具有主题模型的LDA,如何查看不同文档属于哪些主题?

转载 作者:行者123 更新时间:2023-12-03 12:36:07 24 4
gpt4 key购买 nike

我正在使用topicmodels包中的LDA,并且已经在大约30.000个文档上运行了LDA,获得了30个主题,并且获得了该主题的前10个字,它们看起来非常好。但是我想看看哪些文档最有可能属于哪个主题,该怎么办?

myCorpus <- Corpus(VectorSource(userbios$bio))
docs <- userbios$twitter_id
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
myStopwords <- c("twitter", "tweets", "tweet", "tweeting", "account")

# remove stopwords from corpus
myCorpus <- tm_map(myCorpus, removeWords, stopwords('english'))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)


# stem words
# require(rJava) # needed for stemming function
# library(Snowball) # also needed for stemming function
# a <- tm_map(myCorpus, stemDocument, language = "english")

myDtm <- DocumentTermMatrix(myCorpus, control = list(wordLengths=c(2,Inf), weighting=weightTf))
myDtm2 <- removeSparseTerms(myDtm, sparse=0.85)
dtm <- myDtm2

library(topicmodels)

rowTotals <- apply(dtm, 1, sum)
dtm2 <- dtm[rowTotals>0]
dim(dtm2)
dtm_LDA <- LDA(dtm2, 30)

最佳答案

如何使用内置数据集。这将向您显示哪些文档最有可能属于哪个主题。

library(topicmodels)
data("AssociatedPress", package = "topicmodels")

k <- 5 # set number of topics
# generate model
lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k)
# now we have a topic model with 20 docs and five topics

# make a data frame with topics as cols, docs as rows and
# cell values as posterior topic distribution for each document
gammaDF <- as.data.frame(lda@gamma)
names(gammaDF) <- c(1:k)
# inspect...
gammaDF
1 2 3 4 5
1 8.979807e-05 8.979807e-05 9.996408e-01 8.979807e-05 8.979807e-05
2 8.714836e-05 8.714836e-05 8.714836e-05 8.714836e-05 9.996514e-01
3 9.261396e-05 9.996295e-01 9.261396e-05 9.261396e-05 9.261396e-05
4 9.995437e-01 1.140774e-04 1.140774e-04 1.140774e-04 1.140774e-04
5 3.573528e-04 3.573528e-04 9.985706e-01 3.573528e-04 3.573528e-04
6 5.610659e-05 5.610659e-05 5.610659e-05 5.610659e-05 9.997756e-01
7 9.994345e-01 1.413820e-04 1.413820e-04 1.413820e-04 1.413820e-04
8 4.286702e-04 4.286702e-04 4.286702e-04 9.982853e-01 4.286702e-04
9 3.319338e-03 3.319338e-03 9.867226e-01 3.319338e-03 3.319338e-03
10 2.034781e-04 2.034781e-04 9.991861e-01 2.034781e-04 2.034781e-04
11 4.810342e-04 9.980759e-01 4.810342e-04 4.810342e-04 4.810342e-04
12 2.651256e-04 9.989395e-01 2.651256e-04 2.651256e-04 2.651256e-04
13 1.430945e-04 1.430945e-04 1.430945e-04 9.994276e-01 1.430945e-04
14 8.402940e-04 8.402940e-04 8.402940e-04 9.966388e-01 8.402940e-04
15 8.404830e-05 9.996638e-01 8.404830e-05 8.404830e-05 8.404830e-05
16 1.903630e-04 9.992385e-01 1.903630e-04 1.903630e-04 1.903630e-04
17 1.297372e-04 1.297372e-04 9.994811e-01 1.297372e-04 1.297372e-04
18 6.906241e-05 6.906241e-05 6.906241e-05 9.997238e-01 6.906241e-05
19 1.242780e-04 1.242780e-04 1.242780e-04 1.242780e-04 9.995029e-01
20 9.997361e-01 6.597684e-05 6.597684e-05 6.597684e-05 6.597684e-05


# Now for each doc, find just the top-ranked topic
toptopics <- as.data.frame(cbind(document = row.names(gammaDF),
topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))])))
# inspect...
toptopics
document topic
1 1 2
2 2 5
3 3 1
4 4 4
5 5 4
6 6 5
7 7 2
8 8 4
9 9 1
10 10 2
11 11 3
12 12 1
13 13 1
14 14 2
15 15 1
16 16 4
17 17 4
18 18 3
19 19 4
20 20 3


那是你想做的吗?

此答案的提示: https://stat.ethz.ch/pipermail/r-help/2010-August/247706.html

关于r - 具有主题模型的LDA,如何查看不同文档属于哪些主题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14875493/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com