gpt4 book ai didi

R LDA 主题建模 : Result topics contains very similar words

转载 作者:行者123 更新时间:2023-12-04 10:58:24 24 4
gpt4 key购买 nike

全部:

我是 R 主题建模的初学者,这一切都是三周前开始的。所以我的问题是我可以成功地将我的数据处理成语料库、文档术语矩阵和 LDA 函数。我有推文作为我的输入,大约有 460,000 条推文。但我对结果并不满意,所有主题中的单词都非常相似。

packages <- c('tm','topicmodels','SnowballC','RWeka','rJava')
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())))
}

options( java.parameters = "-Xmx4g" )
library(tm)
library(topicmodels)
library(SnowballC)
library(RWeka)

print("Please select the input file");
flush.console();
ifilename <- file.choose();
raw_data=scan(ifilename,'string',sep="\n",skip=1);

tweet_data=raw_data;
rm(raw_data);
tweet_data = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",tweet_data)
tweet_data = gsub("http[^[:blank:]]+", "", tweet_data)
tweet_data = gsub("@\\w+", "", tweet_data)
tweet_data = gsub("[ \t]{2,}", "", tweet_data)
tweet_data = gsub("^\\s+|\\s+$", "", tweet_data)
tweet_data = gsub('\\d+', '', tweet_data)
tweet_data = gsub("[[:punct:]]", " ", tweet_data)

max_size=5000;
data_size=length(tweet_data);
itinerary=ceiling(data_size[1]/max_size);
if (itinerary==1){pre_data=tweet_data}else {pre_data=tweet_data[1:max_size]}

corp <- Corpus(VectorSource(pre_data));
corp<-tm_map(corp,tolower);
corp<-tm_map(corp,removePunctuation);
extend_stop_word=c('description:','null','text:','description','url','text','aca',
'obama','romney','ryan','mitt','conservative','liberal');
corp<-tm_map(corp,removeNumbers);
gc();
IteratedLovinsStemmer(corp, control = NULL)
gc();
corp<-tm_map(corp,removeWords,c(stopwords('english'),extend_stop_word));
gc();
corp <- tm_map(corp, PlainTextDocument)
gc();
dtm.control = list(tolower= F,removePunctuation=F,removeNumbers= F,
stemming= F, minWordLength = 3,weighting= weightTf,stopwords=F)

dtm = DocumentTermMatrix(corp, control=dtm.control)
gc();
#dtm = removeSparseTerms(dtm,0.99)
dtm = dtm[rowSums(as.matrix(dtm))>0,]
gc();

best.model <- lapply(seq(2,50, by=2), function(k){LDA(dtm[1:10,], k)})
gc();
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(seq(2,50, by=2)), LL=as.numeric(as.matrix(best.model.logLik)))
k=best.model.logLik.df[which.max(best.model.logLik.df$LL),1];
cat("Best topic number is k=",k);
flush.console();
gc();
lda.model = LDA(dtm, k,method='VEM')
gc();
write.csv(terms(lda.model,50), file = "terms.csv");
lda_topics=topics(lda.model,1);

以下是我得到的结果:

> terms(lda.model,10)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
[1,] "taxes" "medicare" "tax" "tax" "jobs"
[2,] "pay" "will" "returns" "returns" "plan"
[3,] "welfare" "tax" "gop" "taxes" "gop"
[4,] "will" "care" "taxes" "will" "military"
[5,] "returns" "can" "abortion" "paul" "will"
[6,] "plan" "laden" "can" "medicare" "tax"
[7,] "economy" "vote" "tcot" "class" "paul"
[8,] "budget" "economy" "muslim" "budget" "campaign"
[9,] "president" "taxes" "campaign" "says" "says"
[10,] "reid" "just" "economy" "cuts" "can"
Topic 6 Topic 7 Topic 8 Topic 9
[1,] "medicare" "tax" "medicare" "tax"
[2,] "taxes" "medicare" "tax" "president"
[3,] "plan" "taxes" "jobs" "jobs"
[4,] "tcot" "tcot" "tcot" "taxes"
[5,] "budget" "president" "foreign" "medicare"
[6,] "returns" "jobs" "plan" "tcot"
[7,] "welfare" "budget" "will" "paul"
[8,] "can" "energy" "economy" "health"
[9,] "says" "military" "bush" "people"
[10,] "obamacare" "want" "now" "gop"
Topic 10 Topic 11 Topic 12
[1,] "tax" "gop" "gop"
[2,] "medicare" "tcot" "plan"
[3,] "tcot" "military" "tax"
[4,] "president" "jobs" "taxes"
[5,] "gop" "energy" "welfare"
[6,] "plan" "will" "tcot"
[7,] "jobs" "ohio" "military"
[8,] "will" "abortion" "campaign"
[9,] "cuts" "paul" "class"
[10,] "paul" "budget" "just"

如您所见,“税收”、“医疗保险”等词贯穿所有主题。我注意到当我使用 dtm = removeSparseTerms(dtm,0.99) 时,结果可能会略有变化。以下是我的样本输入数据

> tweet_data[1:10]
[1] " While Romney friends get richer MT Romney Ryan Economic Plans Would Increase Unemployment Deepen Recession"
[2] "Wayne Allyn Root claims proof of Obama s foreign citizenship During a radio show interview Resist"
[3] " President Obama Chief Investor Leave Energy Upgrades to the Businesses Reading President Obama誷 latest Execu "
[4] " Brotherhood starts crucifixions Opponents of Egypt s Muslim president executed naked on trees Obama s tcot"
[5] " Say you stand with President Obama裻he candidate in this election who trusts women to make their own health decisions "
[6] " Romney Ryan Descend Into Medicare Gibberish "
[7] "Maddow Romney demanded opponents tax returns and lied about residency in The Raw Story"
[8] "Is it not grand How can Jews reconcile Obama Carter s treatment of Jews Israel How ca "
[9] " The Tax Returns are Hurting Romney Badly "
[10] " Replacing Gen Dempsey is cruicial to US security Dempsey disappointed by anti Obama campaign by ex military members h "

请帮忙!!谢谢!

最佳答案

减少案例中的主题数量。这将增强主题模型的聚类能力。现在您将现有模型与另一个模型重叠。由于主题索引随迭代而变化,因此很难跟进结果/进行比较。

关于R LDA 主题建模 : Result topics contains very similar words,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26197458/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com