gpt4 book ai didi

python - 如何使用 NLTK BigramAssocMeasures.ch_sq

转载 作者:行者123 更新时间:2023-11-30 21:50:31 26 4
gpt4 key购买 nike

我有单词列表,我想通过考虑两个单词的共现来计算它们的相关性。从一篇论文中我发现它可以使用 PIL 森卡方检验来计算。我还找到了用于计算卡方值的 nltk.BigramAssocMeasures.ch_sq() 。

我可以用它来满足我的需要吗?如何使用 nltk 找到卡方值?

最佳答案

看看this blog from Streamhacker ,它通过代码示例给出了很好的解释。

One of the best metrics for information gain is chi square. NLTK includes this in the BigramAssocMeasures class in the metrics package. To use it, first we need to calculate a few frequencies for each word: its overall frequency and its frequency within each class. This is done with a FreqDist for overall frequency of words, and a ConditionalFreqDist where the conditions are the class labels. Once we have those numbers, we can score words with the BigramAssocMeasures.chi_sq function, then sort the words by score and take the top 10000. We then put these words into a set, and use a set membership test in our feature selection function to select only those words that appear in the set. Now each file is classified based on the presence of these high information words.

关于python - 如何使用 NLTK BigramAssocMeasures.ch_sq,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15401497/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com