gpt4 book ai didi

python - NLTK BigramCollocationFinder 返回的总二元组计数是多少?

转载 作者:行者123 更新时间:2023-11-28 21:54:29 25 4
gpt4 key购买 nike

我正在尝试用我自己的代码重现一些常见的 nlp 指标,包括 Manning 和 Scheutze 的搭配显着性 t 检验和搭配显着性的卡方检验。

我在下面的 24 个标记列表上调用 nltk.bigrams():

tokens = ['she', 'knocked', 'on', 'his', 'door', 'she', 'knocked', 'at', 
'the', 'door','100', 'women', 'knocked', 'on', "Donaldson's", 'door', 'a',
'man', 'knocked', 'on', 'the', 'metal', 'front', 'door']`

我得到 23 个双字母组:

[('she', 'knocked'), ('knocked', 'on'), ('on', 'his'), ('his', 'door'), ('door', 'she'), 
('she', 'knocked'), ('knocked', 'at'), ('at', 'the'), ('the', 'door'), ('door', '100'),
('100', 'women'), ('women', 'knocked'), ('knocked', 'on'), ('on', "Donaldson's"),
("Donaldson's", 'door'), ('door', 'a'), ('a', 'man'), ('man', 'knocked'),
('knocked', 'on'), ('on', 'the'), ('the', 'metal'), ('metal', 'front'), ('front',
'door')]`

如果我想确定 ('she', 'knocked') 的 t 统计量,我输入:

#Total bigrams is 23
t = (2/23 - 4/23)/(math.sqrt(2/23/23))`
t = 1.16826337761`

但是,当我尝试时:

finder = BigramCollocationFinder.from_words(tokens)`
student_t = finder.score_ngrams(bigram_measures.student_t)`
student_t = (('she', 'knocked'), 1.178511301977579)`

当我将 bigram population 的大小变为 24(原始标记列表的长度)时,我得到与 NLTK 相同的答案:

('she', 'knocked'): 1.17851130198

我的问题非常简单:我用什么来计算这些假设检验的人口数量?标记化列表的长度或二元列表的长度?或者程序是否统计了 nltk.bigram() 方法中不输出的终端单元?

最佳答案

首先我们从 nltk.collocations.BigramCollocationFinder 中挖掘出 score_ngram()。参见 https://github.com/nltk/nltk/blob/develop/nltk/collocations.py :

def score_ngram(self, score_fn, w1, w2):
"""Returns the score for a given bigram using the given scoring
function. Following Church and Hanks (1990), counts are scaled by
a factor of 1/(window_size - 1).
"""
n_all = self.word_fd.N()
n_ii = self.ngram_fd[(w1, w2)] / (self.window_size - 1.0)
if not n_ii:
return
n_ix = self.word_fd[w1]
n_xi = self.word_fd[w2]
return score_fn(n_ii, (n_ix, n_xi), n_all)

然后我们看一下来自nltk.metrics.association的student_t(),见https://github.com/nltk/nltk/blob/develop/nltk/metrics/association.py :

### Indices to marginals arguments:

NGRAM = 0
"""Marginals index for the ngram count"""

UNIGRAMS = -2
"""Marginals index for a tuple of each unigram count"""

TOTAL = -1
"""Marginals index for the number of words in the data"""

def student_t(cls, *marginals):
"""Scores ngrams using Student's t test with independence hypothesis
for unigrams, as in Manning and Schutze 5.3.1.
"""
return ((marginals[NGRAM] -
_product(marginals[UNIGRAMS]) /
float(marginals[TOTAL] ** (cls._n - 1))) /
(marginals[NGRAM] + _SMALL) ** .5)

_product()_SMALL 是:

_product = lambda s: reduce(lambda x, y: x * y, s)
_SMALL = 1e-20

回到你的例子:

from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

tokens = ['she', 'knocked', 'on', 'his', 'door', 'she', 'knocked', 'at',
'the', 'door','100', 'women', 'knocked', 'on', "Donaldson's", 'door', 'a',
'man', 'knocked', 'on', 'the', 'metal', 'front', 'door']

finder = BigramCollocationFinder.from_words(tokens)
bigram_measures = BigramAssocMeasures()
print finder.word_fd.N()

student_t = {k:v for k,v in finder.score_ngrams(bigram_measures.student_t)}
print student_t['she', 'knocked']

[输出]:

24
1.17851130198

在 NLTK 中,它以 token 的数量作为人口计数,即 24 。但我要说的是,这通常不是 student_t 考试成绩的计算方式。我会选择#Ngrams 而不是#Tokens,请参阅 nlp.stanford.edu/fsnlp/promo/colloc.pdf 和 www.cse.unt.edu/~rada/CSCE5290/Lectures/Collocations.ppt。但是由于人口是一个常数,并且当#Tokenis 是 >>> 时,我不确定差异的影响大小是否占很大比例,因为对于双字母组来说 #Tokens = #Ngrams+1。

让我们继续深入研究 NLTK 如何计算 student_t。因此,如果我们去掉 student_t() 并只输入参数,我们会得到相同的输出:

import math

NGRAM = 0
"""Marginals index for the ngram count"""

UNIGRAMS = -2
"""Marginals index for a tuple of each unigram count"""

TOTAL = -1
"""Marginals index for the number of words in the data"""

_product = lambda s: reduce(lambda x, y: x * y, s)
_SMALL = 1e-20

def student_t(*marginals):
"""Scores ngrams using Student's t test with independence hypothesis
for unigrams, as in Manning and Schutze 5.3.1.
"""
_n = 2
return ((marginals[NGRAM] -
_product(marginals[UNIGRAMS]) /
float(marginals[TOTAL] ** (_n - 1))) /
(marginals[NGRAM] + _SMALL) ** .5)

ngram_freq = 2
w1_freq = 2
w2_freq = 4
total_num_words = 24

print student_t(ngram_freq, (w1_freq,w2_freq), total_num_words)

所以我们看到,在 NLTK 中,二元组的 student_t 分数是这样计算的:

import math
(2 - 2*4/float(24)) / math.sqrt(2 + 1e-20)

在公式中:

(ngram_freq - (w1_freq * w2_freq) / total_num_words) / sqrt(ngram_freq + 1e-20)

关于python - NLTK BigramCollocationFinder 返回的总二元组计数是多少?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24093509/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com