gpt4 book ai didi

machine-learning - 基于主题建模的文档相关性评分

转载 作者:行者123 更新时间:2023-11-30 08:40:40 24 4
gpt4 key购买 nike

我目前有一个使用 MALLET ( http://mallet.cs.umass.edu/topics.php ) 训练的主题模型,该模型基于大约 80 000 篇收集的新闻文章(这些文章都属于一个类别)。

我希望每次出现新文章时都给出相关性分数(可能与该类别相关,也可能不相关)。有什么办法可以实现这一点吗?我已经阅读了 td-idf,但似乎是根据现有文章而不是任何传入的新文章给出分数。最终目标是过滤掉可能不相关的文章。

非常感谢任何想法或帮助。谢谢!

最佳答案

获得模型(主题)后,您可以根据文档测试新的未见文档(参数 --evaluator-filename [FILENAME] 是您传递新的未见文档的位置)Topic Held-out probability :

Topic Held-out probability

--evaluator-filename [FILENAME] The previous section describes how to get topic proportions for new documents. We often want to estimate thelog probability of new documents, marginalized over all topicconfigurations. Use the MALLET command bin/mallet evaluate-topics--help to get information on using held-out probability estimation. As with topic inference, you must make sure that the new data iscompatible with your training data. Use the option --use-pipe-from[MALLET TRAINING FILE] in the MALLET command bin/mallet import-file orimport-dir to specify a training file.

注意:我确实使用了更多的gensim LDA和LSI,您可以按如下方式传递新文档:

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(lda_model[new_vec])

#output: [(0, 0.020229542), (1, 0.49642297)

Interpretation: you can see (1, 0.49642297) meaning that from the 2topics(categories) we have the new document is close represented by topic #1. So in your case you can take the maximum from the outputting list and you have the relevancy "coefficient" so high coefficient to be in the category and low not (added 2 topics as per better visualization and in your case if you have only #1 topic than just add a simple threshold of the minim you want to consider and if did fail above, for example 0.40, than is in the category otherwise not).

关于machine-learning - 基于主题建模的文档相关性评分,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51471359/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com