gpt4 book ai didi

java - 获取 mallet 中所有文档的实例和主题序列

转载 作者:行者123 更新时间:2023-12-02 02:32:36 25 4
gpt4 key购买 nike

我正在使用 mallet 库进行主题建模。我的数据集位于 filePath 路径中,并且 csvIterator 似乎可以读取数据,因为 model.getData() 有大约 27000 行,等于我的数据集。我编写了一个循环,打印 10 个第一个文档的实例和主题序列,但标记的大小为 0。我哪里出错了?

在下面,我想显示主题中前 5 个单词以及第 10 个文档的比例,但所有输出都是相同的。

控制台中的输出示例:

----文档0

0 0.200 com (1723) twitter (1225) http (871) cbr (688) 堪培拉 (626)

1 0.200 com (981) twitter (901) 天 (205) 五月 (159) 周三 (156)

2 0.200 twitter (1068) com (947) act (433) actvcc (317) 堪培拉 (302)

3 0.200 http (1039) 堪培拉 (841) 工作 (378) dlvr (313) com (228)

4 0.200 com (1185) www (1074) http (831) 新闻 (708) 堪培拉时报 (560)

----文档1

0 0.200 com (1723) twitter (1225) http (871) cbr (688) 堪培拉 (626)

1 0.200 com (981) twitter (901) 天 (205) 五月 (159) 周三 (156)

2 0.200 twitter (1068) com (947) act (433) actvcc (317) 堪培拉 (302)

3 0.200 http (1039) 堪培拉 (841) 工作 (378) dlvr (313) com (228)

4 0.200 com (1185) www (1074) http (831) 新闻 (708) 堪培拉时报 (560)

据我所知,LDA 模型生成每个文档并将它们分配给主题的单词。那么为什么每个文件的结果都是一样的呢?

ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
pipeList.add(new CharSequenceLowercase());
pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")));
//stoplists/en.txt
pipeList.add(new TokenSequenceRemoveStopwords(new File(pathStopWords), "UTF-8", false, false, false));
pipeList.add(new TokenSequence2FeatureSequence());

InstanceList instances = new InstanceList(new SerialPipes(pipeList));

Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
//header of my data set
// row,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment
CsvIterator csvIterator = new CsvIterator(fileReader,
Pattern.compile("^(\\d+)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*([^,]*)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*[^,]*$"),
2, 0, 1);
instances.addThruPipe(csvIterator); // data, label, name fields

int numTopics = 5;
ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);

model.addInstances(instances);

model.setNumThreads(2);


model.setNumIterations(50);
model.estimate();

Alphabet dataAlphabet = instances.getDataAlphabet();
ArrayList<TopicAssignment> arrayTopics = model.getData();

for (int i = 0; i < 10; i++) {
System.out.println("---- document " + i);
FeatureSequence tokens = (FeatureSequence) model.getData().get(i).instance.getData();
LabelSequence topics = model.getData().get(i).topicSequence;

Formatter out = new Formatter(new StringBuilder(), Locale.US);
for (int position = 0; position < tokens.getLength(); position++) {
out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)),
topics.getIndexAtPosition(position));
}
System.out.println(out);

double[] topicDistribution = model.getTopicProbabilities(i);

ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords();


for (int topic = 0; topic < numTopics; topic++) {
Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator();
out = new Formatter(new StringBuilder(), Locale.US);
out.format("%d\t%.3f\t", topic, topicDistribution[topic]);
int rank = 0;
while (iterator.hasNext() && rank < 5) {
IDSorter idCountPair = iterator.next();
out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight());
rank++;
}
System.out.println(out);
}

StringBuilder topicZeroText = new StringBuilder();
Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();

int rank = 0;
while (iterator.hasNext() && rank < 5) {
IDSorter idCountPair = iterator.next();
topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
rank++;
}

}

最佳答案

主题是在模型级别定义的,而不是在文档级别定义的。它们对于所有人来说都应该是相同的。

看起来您的所有文本都是网址。将 PrintInputPipe 添加到导入序列可能有助于调试。

关于java - 获取 mallet 中所有文档的实例和主题序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46851509/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com