gpt4 book ai didi

r - text2vec 和 topicmodels 是否可以通过适合 LDA 的参数设置来生成类似的主题?

转载 作者:行者123 更新时间:2023-12-04 10:10:24 29 4
gpt4 key购买 nike

我想知道不同包的结果(因此算法)有何不同,以及是否可以以产生类似主题的方式设置参数。我特别查看了 text2vectopicmodels 包。

我使用以下代码比较了使用这些包生成的 10 个主题(有关术语,请参阅代码部分)。我无法生成具有相似含义的主题集。例如。 text2vec 中的主题 10 与“警察”有关,topicmodels 生成的主题均未提及“警察”或类似术语。此外,我无法确定 topicmodels 生成的主题 5 的吊坠与 text2vec 生成的主题中的“life-love-familiy-war”有关。

我是 LDA 的初学者,因此,对于有经验的程序员来说,我的理解可能听起来很幼稚。然而,直觉上,人们会假设应该可以生成具有相似含义的主题集来证明结果的有效性/稳健性。当然,不一定是完全相同的一组术语,而是针对相似主题的术语列表。

也许问题只是我对这些术语列表的人工解释不足以捕捉相似性,但也许有一些参数可能会增加人工解释的相似性。有人可以指导我如何设置参数以实现此目的,或者以其他方式提供解释或提示合适的资源以提高我对此事的理解吗?

这里有一些可能相关的问题:

  • 我知道 text2vec 不使用标准 Gibbs 采样,而是使用 WarpLDA ,这已经是算法与 topcimodels 的区别。如果我的理解是正确的,topicmodels 中使用的先验 alphadelta 设置为 doc_topic_prior topic_word_prior分别在text2vec中。
  • 此外,在后处理中,text2vec 允许采用 lambda 根据频率对主题术语进行排序。我还不明白,术语在 topicmodels 中是如何排序的——相当于设置 lambda=1? (我尝试了 0 到 1 之间的不同 lambda,但没有得到类似的主题)
  • 另一个问题是,即使设置seed (see, e.g., this question),也似乎很难生成一个完全可重现的示例。 .这不是我的直接问题,但可能会更难回答。

对于冗长的问题深表歉意,在此先感谢您的任何帮助或建议。

更新 2:我已将第一次更新的内容移至基于更完整分析的答案中。

更新:遵循text2vec 包创建者的有用评论 Dmitriy Selivanov , 我可以确认设置 lambda=1 增加了两个包生成的术语列表之间主题的相似性。

此外,我通过快速检查 length(setdiff())length(intersect()) 仔细研究了两个包生成的术语列表之间的差异> 跨主题(见下面的代码)。这个粗略的检查表明 text2vec 丢弃了每个主题的几个术语 - 可能是个别主题的概率阈值? topicmodels 保留所有主题的所有术语。这解释了可以(由人类)从术语列表中得出的部分含义差异。

如上所述,生成可重现的示例似乎很困难,因此我没有在下面的代码中调整所有数据示例。由于运行时间很短,任何人都可以检查他/她自己的系统。

    library(text2vec)
library(topicmodels)
library(slam) #to convert dtm to simple triplet matrix for topicmodels

ntopics <- 10
alphaprior <- 0.1
deltaprior <- 0.001
niter <- 1000
convtol <- 0.001
set.seed(0) #for text2vec
seedpar <- 0 #for topicmodels

#Generate document term matrix with text2vec
tokens = movie_review$review[1:1000] %>%
tolower %>%
word_tokenizer

it = itoken(tokens, ids = movie_review$id[1:1000], progressbar = FALSE)

vocab = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)

vectorizer = vocab_vectorizer(vocab)

dtm = create_dtm(it, vectorizer, type = "dgTMatrix")


#LDA model with text2vec
lda_model = text2vec::LDA$new(n_topics = ntopics
,doc_topic_prior = alphaprior
,topic_word_prior = deltaprior
)

doc_topic_distr = lda_model$fit_transform(x = dtm
,n_iter = niter
,convergence_tol = convtol
,n_check_convergence = 25
,progressbar = FALSE
)


#LDA model with topicmodels
ldatopicmodels <- LDA(as.simple_triplet_matrix(dtm), k = ntopics, method = "Gibbs",
LDA_Gibbscontrol = list(burnin = 100
,delta = deltaprior
,alpha = alphaprior
,iter = niter
,keep = 50
,tol = convtol
,seed = seedpar
,initialize = "seeded"
)
)

#show top 15 words
lda_model$get_top_words(n = 10, topic_number = c(1:10), lambda = 0.3)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] "finally" "men" "know" "video" "10" "king" "five" "our" "child" "cop"
# [2,] "re" "always" "ve" "1" "doesn" "match" "atmosphere" "husband" "later" "themselves"
# [3,] "three" "lost" "got" "head" "zombie" "lee" "mr" "comedy" "parents" "mary"
# [4,] "m" "team" "say" "girls" "message" "song" "de" "seem" "sexual" "average"
# [5,] "gay" "here" "d" "camera" "start" "musical" "may" "man" "murder" "scenes"
# [6,] "kids" "within" "funny" "kill" "3" "four" "especially" "problem" "tale" "police"
# [7,] "sort" "score" "want" "stupid" "zombies" "dance" "quality" "friends" "television" "appears"
# [8,] "few" "thriller" "movies" "talking" "movies" "action" "public" "given" "okay" "trying"
# [9,] "bit" "surprise" "let" "hard" "ask" "fun" "events" "crime" "cover" "waiting"
# [10,] "hot" "own" "thinking" "horrible" "won" "tony" "u" "special" "stan" "lewis"
# [11,] "die" "political" "nice" "stay" "open" "twist" "kelly" "through" "uses" "imdb"
# [12,] "credits" "success" "never" "back" "davis" "killer" "novel" "world" "order" "candy"
# [13,] "two" "does" "bunch" "didn" "completely" "ending" "copy" "show" "strange" "name"
# [14,] "otherwise" "beauty" "hilarious" "room" "love" "dancing" "japanese" "new" "female" "low"
# [15,] "need" "brilliant" "lot" "minutes" "away" "convincing" "far" "mostly" "girl" "killing"

terms(ldatopicmodels, 10)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] "show" "where" "horror" "did" "life" "such" "m" "films" "man" "seen"
# [2,] "years" "minutes" "pretty" "10" "young" "character" "something" "music" "new" "movies"
# [3,] "old" "gets" "best" "now" "through" "while" "re" "actors" "two" "plot"
# [4,] "every" "guy" "ending" "why" "love" "those" "going" "role" "though" "better"
# [5,] "series" "another" "bit" "saw" "woman" "does" "things" "performance" "big" "worst"
# [6,] "funny" "around" "quite" "didn" "us" "seems" "want" "between" "back" "interesting"
# [7,] "comedy" "nothing" "little" "say" "real" "book" "thing" "love" "action" "your"
# [8,] "again" "down" "actually" "thought" "our" "may" "know" "play" "shot" "money"
# [9,] "tv" "take" "house" "still" "war" "work" "ve" "line" "together" "hard"
# [10,] "watching" "these" "however" "end" "father" "far" "here" "actor" "against" "poor"
# [11,] "cast" "fun" "cast" "got" "find" "scenes" "doesn" "star" "title" "least"
# [12,] "long" "night" "entertaining" "2" "human" "both" "look" "never" "go" "say"
# [13,] "through" "scene" "must" "am" "shows" "yet" "isn" "played" "city" "director"
# [14,] "once" "back" "each" "done" "family" "audience" "anything" "hollywood" "came" "probably"
# [15,] "watched" "dead" "makes" "3" "mother" "almost" "enough" "always" "match" "video"

#UPDATE

#number of terms in each model is the same
length(ldatopicmodels@terms)
# [1] 2170
nrow(vocab)
# [1] 2170

#number of NA entries for termlist of first topic differs
sum(is.na(
lda_model$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)[,1]
)
)
#[1] 1778

sum(is.na(
terms(ldatopicmodels, length(ldatopicmodels@terms))
)
)
#[1] 0


#function to check number of terms that differ between two sets of topic collections (excluding NAs)
lengthsetdiff <- function(x, y) {

apply(x, 2, function(i) {

apply(y, 2, function(j) {

length(setdiff(i[!is.na(i)],j[!is.na(j)]))
})

})

}


#apply the check
termstopicmodels <- terms(ldatopicmodels,length(ldatopicmodels@terms))
termstext2vec <- lda_model$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)


lengthsetdiff(termstopicmodels,
termstopicmodels)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0

lengthsetdiff(termstext2vec,
termstext2vec)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 0 340 318 335 292 309 320 355 294 322
# [2,] 355 0 321 343 292 319 311 346 302 339
# [3,] 350 338 0 316 286 309 311 358 318 322
# [4,] 346 339 295 0 297 310 301 335 309 332
# [5,] 345 330 307 339 0 310 310 354 309 333
# [6,] 350 345 318 340 298 0 311 342 308 325
# [7,] 366 342 325 336 303 316 0 364 311 325
# [8,] 355 331 326 324 301 301 318 0 311 335
# [9,] 336 329 328 340 298 309 307 353 0 314
# [10,] 342 344 310 341 300 304 299 355 292 0

lengthsetdiff(termstopicmodels,
termstext2vec)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] 1778 1778 1778 1778 1778 1778 1778 1778 1778 1778
# [2,] 1793 1793 1793 1793 1793 1793 1793 1793 1793 1793
# [3,] 1810 1810 1810 1810 1810 1810 1810 1810 1810 1810
# [4,] 1789 1789 1789 1789 1789 1789 1789 1789 1789 1789
# [5,] 1831 1831 1831 1831 1831 1831 1831 1831 1831 1831
# [6,] 1819 1819 1819 1819 1819 1819 1819 1819 1819 1819
# [7,] 1824 1824 1824 1824 1824 1824 1824 1824 1824 1824
# [8,] 1778 1778 1778 1778 1778 1778 1778 1778 1778 1778
# [9,] 1820 1820 1820 1820 1820 1820 1820 1820 1820 1820
# [10,] 1798 1798 1798 1798 1798 1798 1798 1798 1798 1798

lengthsetdiff(termstext2vec,
termstopicmodels)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0

#also the intersection can be checked between the two sets
lengthintersect <- function(x, y) {

apply(x, 2, function(i) {

apply(y, 2, function(j) {

length(intersect(i[!is.na(i)], j[!is.na(j)]))
})

})

}

lengthintersect(termstopicmodels,
termstext2vec)

# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] 392 392 392 392 392 392 392 392 392 392
# [2,] 377 377 377 377 377 377 377 377 377 377
# [3,] 360 360 360 360 360 360 360 360 360 360
# [4,] 381 381 381 381 381 381 381 381 381 381
# [5,] 339 339 339 339 339 339 339 339 339 339
# [6,] 351 351 351 351 351 351 351 351 351 351
# [7,] 346 346 346 346 346 346 346 346 346 346
# [8,] 392 392 392 392 392 392 392 392 392 392
# [9,] 350 350 350 350 350 350 350 350 350 350
# [10,] 372 372 372 372 372 372 372 372 372 372

最佳答案

在用一些比较结果更新了我的问题之后,我仍然对细节更感兴趣。因此,我在 text2vec(5000 个文档)中包含的完整 movie_review 数据集上运行了 lda 模型。为了产生半真实的结果,我还引入了一些温和的预处理和停用词删除。 (抱歉下面的长代码示例)

我的结论是,这两个包产生的一些“好”主题(从主观的角度来看)在一定程度上具有可比性(尤其是下面示例中的最后三个主题并不是很好,很难比较) .但是,查看两个包之间的相似主题,会为每个主题产生不同的(主观)关联。因此,标准的 Gibbs 采样和 WarpLDA 算法似乎为给定数据捕获了相似的主题区域,但主题中表达的“情绪”不同。

我认为造成差异的主要原因在于 WarpLDA 算法似乎丢弃了术语并在 beta 中引入了 NA 值> 矩阵(术语主题分布)。请参见下面的示例。因此,其更快的收敛似乎是通过牺牲完整性来实现的。

我不想主观地判断哪些主题“更好”,而是让您自己判断。

此分析的一个重要限制是,我(还)没有检查最佳主题数量的结果,我只使用了k=10。因此,对于最佳 k,主题的可比性可能会增加,无论如何质量都会提高,从而可能会提高“情绪”。 (最佳 k 可能会再次因用于查找 k 的度量而不同。)

library(text2vec)
library(topicmodels)
library(slam) #to convert dtm to simple triplet matrix for topicmodels
library(LDAvis)
library(tm) #for stopwords only

ntopics <- 10
alphaprior <- 0.1
deltaprior <- 0.001
niter <- 1000
convtol <- 0.001
set.seed(0) #for text2vec
seedpar <- 0 #for topicmodels

docs <- movie_review$review

preproc_fun <- function(x) {
tolower(x) %>%
{ gsub("[\\W]+", " ", ., perl=T) } %>%
{ gsub("[\\d]+", " ", ., perl=T) } %>%
{ gsub(paste0("(?<=\\b)(\\w{1,", 2, "})(?=\\b)"), "", ., perl=T) } %>%
{ gsub("\\s+", " ", . , perl=T) } %>%
{ gsub("^\\s|\\s$", "", ., perl=T) } %>%
return()
}

#Generate document term matrix with text2vec
tokens = docs %>%
preproc_fun %>%
word_tokenizer

it = itoken(tokens, ids = movie_review$id, progressbar = FALSE)

vocab = create_vocabulary(it, stopwords = tm::stopwords()) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)

vectorizer = vocab_vectorizer(vocab)

dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
dim(dtm)
# [1] 5000 7407

#LDA model with text2vec
ldatext2vec = text2vec::LDA$new(n_topics = ntopics
,doc_topic_prior = alphaprior
,topic_word_prior = deltaprior
)

doc_topic_distr = ldatext2vec$fit_transform(x = dtm
,n_iter = niter
,convergence_tol = convtol
,n_check_convergence = 25
,progressbar = FALSE
)


control_Gibbs_topicmodels <- list(
alpha = alphaprior
,delta = deltaprior
,iter = niter
,burnin = 100
,keep = 50
,nstart = 1
,best = TRUE
,seed = seedpar
)

#LDA model with topicmodels
ldatopicmodels <- LDA(as.simple_triplet_matrix(dtm)
,k = ntopics
,method = "Gibbs"
,control = control_Gibbs_topicmodels
)


#I have ordered the topics manually after printing top 15 terms and put similar (at least from my subjective standpoint) topics at the beginning
topicsterms_ldatopicmodels <- terms(ldatopicmodels,length(ldatopicmodels@terms))[,c(6,8,10,3,5,9,7,4,1,2)]
topicsterms_ldatext2vec <- ldatext2vec$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)[, c(9,6,4,10,5,3,7,2,8,1)]

#show top 15 words
topicsterms_ldatext2vec[1:15,]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] "show" "performance" "films" "war" "horror" "say" "man" "love" "know" "man"
# [2,] "series" "role" "director" "american" "killer" "better" "back" "life" "say" "woman"
# [3,] "funny" "films" "scenes" "book" "doesn" "nothing" "last" "big" "life" "life"
# [4,] "still" "music" "audience" "may" "little" "watching" "match" "real" "didn" "police"
# [5,] "original" "love" "though" "world" "isn" "know" "big" "women" "going" "father"
# [6,] "years" "cast" "may" "young" "guy" "worst" "men" "job" "now" "world"
# [7,] "version" "john" "quite" "family" "actually" "didn" "takes" "black" "something" "black"
# [8,] "episode" "play" "real" "mother" "gets" "something" "woman" "new" "things" "wife"
# [9,] "now" "man" "seems" "true" "dead" "actors" "take" "money" "back" "goes"
# [10,] "dvd" "played" "work" "years" "look" "minutes" "young" "work" "saw" "new"
# [11,] "saw" "actor" "scene" "novel" "house" "films" "life" "game" "family" "without"
# [12,] "old" "excellent" "actors" "however" "looks" "least" "city" "world" "love" "around"
# [13,] "watching" "young" "interesting" "small" "poor" "script" "town" "still" "thought" "scene"
# [14,] "watched" "perfect" "rather" "quite" "pretty" "budget" "dance" "comedy" "got" "shot"
# [15,] "better" "high" "yet" "history" "stupid" "lot" "rock" "american" "thing" "another"

topicsterms_ldatopicmodels[1:15,]
# Topic 6 Topic 8 Topic 10 Topic 3 Topic 5 Topic 9 Topic 7 Topic 4 Topic 1 Topic 2
# [1,] "show" "performance" "films" "war" "horror" "funny" "man" "love" "life" "little"
# [2,] "years" "role" "director" "american" "house" "better" "wife" "book" "love" "music"
# [3,] "series" "cast" "something" "documentary" "scene" "say" "gets" "films" "world" "action"
# [4,] "now" "actor" "enough" "part" "killer" "know" "father" "version" "young" "fun"
# [5,] "episode" "play" "doesn" "world" "sex" "watching" "back" "still" "family" "big"
# [6,] "old" "performances" "nothing" "history" "scenes" "thing" "goes" "original" "real" "rock"
# [7,] "back" "comedy" "actually" "america" "gore" "pretty" "new" "quite" "may" "king"
# [8,] "love" "played" "things" "new" "blood" "guy" "woman" "music" "man" "animation"
# [9,] "saw" "director" "seems" "hollywood" "around" "didn" "later" "years" "work" "films"
# [10,] "shows" "job" "know" "japanese" "little" "got" "home" "scenes" "little" "black"
# [11,] "new" "john" "without" "white" "woman" "worst" "money" "old" "lives" "song"
# [12,] "family" "actors" "real" "shot" "night" "thought" "son" "scene" "mother" "pretty"
# [13,] "dvd" "star" "far" "despite" "dead" "wasn" "police" "better" "men" "quite"
# [14,] "still" "excellent" "might" "still" "zombie" "minutes" "husband" "bit" "find" "musical"
# [15,] "know" "work" "fact" "early" "scary" "stupid" "town" "times" "women" "effects"

#number of total terms for each model is the same
#however, the ldatext2vec from text2vec has NA values
length(ldatopicmodels@terms)
# [1] 7407
length(ldatopicmodels@terms[ !is.na(ldatopicmodels@terms)])
# [1] 7407

terms_ldatext2vec <- unique(as.character(topicsterms_ldatext2vec))
length(terms_ldatext2vec)
# [1] 7408
length(terms_ldatext2vec[!is.na(terms_ldatext2vec)])
# [1] 7407

#number of NA entries in topic/termlists of text2vec ldatext2vec
dim(topicsterms_ldatext2vec)
#[1] 7407 10
sum(is.na(topicsterms_ldatext2vec))
# [1] 60368
#share of NA values
sum(is.na(topicsterms_ldatext2vec))/(dim(topicsterms_ldatext2vec)[1]*dim(topicsterms_ldatext2vec)[2])
#[1] 0.8150128

#no NA values in ldatopicmodels
sum(is.na(terms(ldatopicmodels, length(ldatopicmodels@terms))))
#[1] 0

#function to check number of terms that differ between two sets of topic collections (excluding NAs)
lengthsetdiff <- function(x, y) {
apply(x, 2, function(i) {
apply(y, 2, function(j) {
length(setdiff(i[!is.na(i)],j[!is.na(j)]))
})
})
}

#also the intersection can be checked between the two sets
lengthintersect <- function(x, y) {
apply(x, 2, function(i) {
apply(y, 2, function(j) {
length(intersect(i[!is.na(i)], j[!is.na(j)]))
})
})
}

#since especially the top words are of interest, we first check the intersection of top 20 words
#please note that the order of the topics, especially the last 3 is subjective
lengthintersect(topicsterms_ldatopicmodels[1:20,],
topicsterms_ldatext2vec[1:20,])
# Topic 6 Topic 8 Topic 10 Topic 3 Topic 5 Topic 9 Topic 7 Topic 4 Topic 1 Topic 2
# [1,] 13 1 0 2 0 3 1 7 1 2
# [2,] 1 9 1 0 0 0 2 4 5 4
# [3,] 0 4 8 0 2 0 0 4 3 2
# [4,] 3 0 0 5 2 0 1 5 6 2
# [5,] 1 0 3 0 7 7 1 1 1 2
# [6,] 2 3 6 0 0 10 0 3 0 1
# [7,] 4 2 1 2 1 0 8 1 4 3
# [8,] 3 4 2 5 1 1 2 3 8 5
# [9,] 10 0 4 0 0 8 1 3 3 1
# [10,] 1 0 1 3 3 0 7 1 5 2



#apply the check with the topics ordered as shown above for the top 15 words

#all words are appear in each topic
lengthsetdiff(topicsterms_ldatopicmodels,
topicsterms_ldatopicmodels)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0

#not all words appear in each topic
lengthsetdiff(topicsterms_ldatext2vec ,
topicsterms_ldatext2vec )
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 0 1188 1216 1241 1086 1055 1196 1131 1126 1272
# [2,] 1029 0 1203 1223 1139 1073 1188 1140 1188 1260
# [3,] 1032 1178 0 1224 1084 1024 1186 1122 1164 1238
# [4,] 1075 1216 1242 0 1175 1139 1202 1152 1207 1271
# [5,] 1011 1223 1193 1266 0 1082 1170 1170 1160 1214
# [6,] 993 1170 1146 1243 1095 0 1178 1119 1092 1206
# [7,] 1078 1229 1252 1250 1127 1122 0 1200 1195 1227
# [8,] 1030 1198 1205 1217 1144 1080 1217 0 1171 1211
# [9,] 966 1187 1188 1213 1075 994 1153 1112 0 1198
# [10,] 1095 1242 1245 1260 1112 1091 1168 1135 1181 0

#difference of terms in topics per topic between the two models
lengthsetdiff(topicsterms_ldatopicmodels,
topicsterms_ldatext2vec)
# Topic 6 Topic 8 Topic 10 Topic 3 Topic 5 Topic 9 Topic 7 Topic 4 Topic 1 Topic 2
# [1,] 6157 6157 6157 6157 6157 6157 6157 6157 6157 6157
# [2,] 5998 5998 5998 5998 5998 5998 5998 5998 5998 5998
# [3,] 5973 5973 5973 5973 5973 5973 5973 5973 5973 5973
# [4,] 5991 5991 5991 5991 5991 5991 5991 5991 5991 5991
# [5,] 6082 6082 6082 6082 6082 6082 6082 6082 6082 6082
# [6,] 6095 6095 6095 6095 6095 6095 6095 6095 6095 6095
# [7,] 6039 6039 6039 6039 6039 6039 6039 6039 6039 6039
# [8,] 6056 6056 6056 6056 6056 6056 6056 6056 6056 6056
# [9,] 5997 5997 5997 5997 5997 5997 5997 5997 5997 5997
# [10,] 5980 5980 5980 5980 5980 5980 5980 5980 5980 5980

lengthsetdiff(topicsterms_ldatext2vec,
topicsterms_ldatopicmodels)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0

关于r - text2vec 和 topicmodels 是否可以通过适合 LDA 的参数设置来生成类似的主题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46788242/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com