python - "Third order"Kneser-Key 平滑的正确实现(对于 Trigram 模型)-6ren

python - "Third order"Kneser-Key 平滑的正确实现(对于 Trigram 模型)

转载作者：太空宇宙更新时间：2023-11-03 16:44:16

在下面的代码中，我尝试根据基于固定折扣的 Knesr-Kney 平滑方法计算三元组的概率。我浏览了描述克内斯-克尼的重要论文 Goodman &Chen和 Dan Jurafsky 。关于堆栈交换的这个[问题]( https://stats.stackexchange.com/questions/114863/in-kneser-ney-smoothing-how-are-unseen-words-handled )是对二元语法案例的一个很好的总结。

我发现很难从三元组案例的数学形式中驱动 Kneser-Ney 的实现，因为它们相当复杂且难以消化。经过长时间的搜索，我找不到该方法的代码解释。

我假设一个封闭词汇表并且想要检查此代码是否是正确的实现？

具体来说，函数 score_trigram(self,tri_g) 将一个三元组作为元组 ('u','v','w') 并尝试计算其概率的对数，根据克尼西-克尼的说法。 init 方法中显示的字典存储基于某些语料库学习到的一元词、二元词、三元词的频率。

假设这些频率计数已正确初始化并给出。

如果我们有一个三元组 (a,b,c)，那么对于非零计数的三元组情况，Kneser-kney 的高级公式:

P((a,b,c)) = P_ML_discounted((a,b,c)) + Total_discount_1 * P_KN((b,c))

P_ML_discounted((a,b,c)) = 计数((a,b,c)) - 折扣/计数((a,b))

total_discount_1 = 折扣 * follow_up_count((a,b))/计数((a,b))

P_KN((b,c)) = ((b,c)) 的连续计数/唯一三元组计数 + 总折扣_2 *P_KN(c)

total_discount_2 = 折扣+follow_up_count(b)/count_unique_bigrams

P_KN(c) = continuation_count(c) - 折扣/count_unique_bigrams + 折扣*1/vocabulary_size

我有两个问题:
1- 前面的方程对于 Knesery-Kney trigram 情况是否正确？

2-代码中对应的评分函数是否正确实现？

类自定义语言模型:

def __init__(self, corpus):
    """Initialize your data structures in the constructor."""
    ### n-gram counts
    # trigram dict entry > ('word_a','word_b','word_c') : 10
    self.trigramCounts = collections.defaultdict(lambda: 0)

    # bigram dict entry > ('word_a','word_b') : 11
    self.bigramCounts = collections.defaultdict(lambda: 0)

    # unigram dict entry > 'word_a' : 15
    self.unigramCounts = collections.defaultdict(lambda: 0)

    ###Kneser-kney(KN) counts

    '''The follow_up count of a bi-gram (a,b) is the number of unique tri-grams 
    starts with (a,b), for example if the frequency of (a,b,c) tri-gram is 3,
    this increments the follow_up count of (a,b) by one,also if the frequency
    of (a,b,d) is 5 this adds one to the continuation count of (y,z).'''
    # dict entry as >  ('word_a','word_b') : 7
    self.bigram_follow_up_dict = collections.defaultdict(lambda: 0)

    '''The continuation count of a bigram (y,z) is the number of unique trigrams
    ends with (y,z), for example if the frequency of (x,y,z) trigram is 3,
    this increments the continuation count of (y,z) by one,
    also if the frequency of (r,y,z) is 5 this adds one to the continuation count of (y,z).'''
    # dict entry as > ('word_a','word_b') : 5
    self.bigram_continuation_dict = collections.defaultdict(lambda: 0)

    '''The continuation count of a unigam 'z' is the number of unique bigrams ends
    with 'z',for example if the frequency of ('y','z') bigram is 3, this increments 
    the continuation count of 'z' by one. Also if the frequency of ('w','z') is 5,
    this adds one to the continuation count of 'z'.
    '''
    # dict entry as >  'word_z' : 5
    self.unigram_continuation_count = collections.defaultdict(lambda: 0)

    '''The follow-up count of a unigam 'a' is the number of unique bigrams starts
    with 'a',for example if the frequency of ('a','b') bigram is 3, this increments
    the continuation count of 'a' by one. Also if the frequency of ('a','c') is 5,
    this adds one to the continuationcount of 'a'. '''
    # dict entry as >  'word_a' : 5
    self.unigram_follow_up_count = collections.defaultdict(lambda: 0)

    # total number of words, fixed discount
    self.total =0 , self.d=0.75 ,self.train(corpus)

def train(self, corpus):
    # count and initialize the dictionaries
    pass
def score_trigram(self,tri_g): 

    score = 0.0 , w1 = tri_g[0], w2 = tri_g[1] , w3 = tri_g[2]
    # use the trigram if it has a frequency > 0
    if self.trigramCounts[(w1,w2,w3)] > 0 and self.bigramCounts[(w1,w2)] > 0 :
        score += self.top_level_trigram_prob(*tri_g)
    # otherwise use the bigram (w2,w3) as an approximation
    else :
        if self.bigramCounts[(w2,w3)] > 0  and self.unigramCounts[w2]> 0:
            score = score + self.top_level_bigram_prob(w2,w3)
        ## otherwise use the unigram w3 as an approximation
        else:
            score += math.log(self.pkn_unigram(w3))               
    return score

def top_level_trigram_prob(self,w1,w2,w3):
    score=0.0
    term1 = max(self.trigramCounts[(w1,w2,w3)]-self.d,0)/self.bigramCounts[(w1,w2)]
    alfa = self.d * self.bigram_follow_set[(w1,w2)] / len(self.bigram_follow_set)
    term2 = self.pkn_bigram(w2,w3)
    score += math.log(term1+ alfa* term2)
    return score  

def top_level_bigram_prob(self,w1,w2):
    score=0.0
    term1 = max(self.bigramCounts[(w1,w2)]-self.d,0)/self.unigramCounts[w1]
    alfa = self.d * self.unigram_follow_set[w1]/self.unigramCounts[w1]
    term2 = self.pkn_unigram (w2)
    score += math.log(term1+ alfa* term2)
    return score 

def pkn_bigram(self,w1,w2):           
    return self.pkn_bigram_contuation(w1,w2) + self.pkn_bigram_follow_up(w1) * self.pkn_unigram(w2)


def pkn_bigram_contuation (self,w1,w2):
    ckn= self.bigram_continuation_dict[(w1,w2)]
    term1 = (max(ckn -self.d,0)/len(self.bigram_continuation_dict))        
    return term1

def pkn_bigram_follow_up (self,w1):
    ckn = self.unigram_follow_dict[w1]
    alfa = self.d * ckn / len(self.bigramCounts)
    return alfa  

def pkn_unigram (self,w1):
    #continuation of w1 + lambda uniform
    ckn= self.unigram_continuation_dict[w1]
    p_cont= float(max(ckn - self.d,0)) / len(self.bigramCounts)+ 1.0/len(self.unigramCounts )
    return p_cont

最佳答案

我来回答你的第一个问题。

下面我标记的是你的方程(我纠正了你在(5)中的拼写错误，并根据你的代码在(2)和(6)中添加了 max(,0) )

(1) P((a,b,c)) = P_ML_discounted((a,b,c)) + Total_discount_1 * P_KN((b,c))

(2) P_ML_discounted((a,b,c)) = max(count((a,b,c)) - 折扣, 0)/count((a,b))

(3) Total_discount_1 = 折扣 * follow_up_count((a,b))/计数((a,b))

(4) P_KN((b,c)) = ((b,c)) 的连续计数/唯一三元组计数 + 总折扣_2 *P_KN(c)

(5) Total_discount_2 = 折扣 * follow_up_count(b)/count_unique_bigrams

(6) P_KN(c) = max(continuation_count(c) - 折扣, 0)/count_unique_bigrams + 折扣*1/vocabulary_size

关于上式的正确性:

(1)~(3):正确

(4) (5):不正确。在这两个方程中，count_of_unique_trigrams 应替换为“第二个单词为 b 的唯一三元组的计数”，即形式为 (,b,) 的唯一三元组计数。

我在你的代码中看到， pkn_bigram_contuation() 确实对 ((b,c)) 的 continuation_count 进行了折扣，这是正确的。不过，它没有反射(reflect)在您的等式 (4) 中。

(6) 我认为您正在使用 Dan Jurafsky 中的实现方程 (4.37) 。问题是作者不清楚如何计算 lambda(epsilon) 以使一元概率正确归一化。

实际上，一元词概率不需要打折扣(参见第 5 页标题为“Kneser-Ney 详细信息”的幻灯片 here )，因此 (6) 可以简单地表示为

P_KN(c) = continuation_count(c)/count_unique_bigrams。

关于python - "Third order"Kneser-Key 平滑的正确实现(对于 Trigram 模型)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36477499/

文章推荐： ruby-on-rails - 将模块添加到测试类

文章推荐： python - 从 Python 应用程序获取 Qt 样式表

文章推荐： c# - LoginView控件问题!

文章推荐： Python CSV 写入对象

sql - 在 SQL 中的 order by 中嵌套 order by/order by
我正在寻找通过 sql 查询对我的 sql 结果进行排序，大概在 order by 子句中使用某种嵌套的 order by/order by 我有以下数据: TERM USER I
sql - order by 后跟从属 order by
我有一个表格，其中包含如下所示的部分数据。我已经在 edition_id 上完成了订购。现在还需要订购 laungauge_id，这取决于 edition_id 的值。 Edition_id 是指报纸
SQL Order By 中的 Order By
所以我有两个表，Questions 和 Answers，由多对多关系表 QuestionsAnswers 连接。 Questions 有一个排序列，允许我控制它们如何显示给用户，而 Questions
recursion - FP : What does "order" mean in "high order" functions? 递归函数是否为 "high order"函数？
当我们说“高阶”函数时，我怀疑“阶”的真正含义是什么？例如，我有一个嵌入式函数调用: f.g.h 那么它叫“三阶”函数吗？ “高阶”函数是静态函数累加的概念吗？然后当我有一个递归函数 f 时，在运行时
sql - 对于多个 sql order by 子句，即使之前的 order by 已经证明行不相等，所有的 order bys 是否都运行？
在具有多个 order by 子句的 SQL 查询中，它们是否真的在执行期间全部运行？例子: select * from my_table order by field5, field3, fiel
SPARQL group by 和 order by : not ordered
我跟进 query其中 schema.org 数据库用于查找类的子级数量 - 作为比我的应用程序更简单的数据库。我想按字母顺序连接 child 的名字。查询: prefix schema: pre
wolfram-mathematica - Ordering@Ordering 和排名排列
正如 nazdrovje 所指出的(参见 here ) Ordering@Ordering 可用于获取列表中每个元素的排名。即使列表包含重复元素，结果也是 n 排列(作为整数 1 到 n 的有序列表，
MySQL:如何在使用父查询 "order by"的同时使用子查询列 "order by"？
我有两张 table 。它们都有日期和 item_id 列。我正在通过 item_id 加入他们。结果应按两个日期列一起排序下面的代码有效，生成正确的结果集... 但是它们仅按第一个表的日期排
mysql - SQL ORDER BY by 内部 ORDER BY
尝试掌握 SQL 我想按日期订购，然后在其中按标题订购。示例: SELECT * FROM tblboek ORDER BY jr_van_uitgave DESC 如何在按年龄的订单中按头衔排序？
mysql order by field order 不符合我的期望
我想使用 FIELD 参数对我的 SQL 输出进行排序，但是当我这样做时，它首先吐出我不想要的结果，然后它首先吐出我想要的结果。在结果之上，它首先吐出。如果这有意义的话 ;) 如何先吐出已定义的值，然
php - MySQL order-by 原始 "where order"
我有一个无法破解的排序问题。我这样从我的表中选择: SELECT * FROM 'sidemodules' WHERE name = 'module1' OR name = 'module2' OR
python - 冲突 'order' 模型在应用程序 'order'
我对 Django oscar 的覆盖模型有疑问。我想为模型添加一个新字段，但是当我这样做时，我遇到了 RuntimeError: Conflicting 'order' models in appl
Multiple "order by" in LINQ(LINQ中的多个“order by”)
我有两个表，电影和类别，我想先按CategoryID获得一个排序列表，然后按名称排序。。电影表格有三个列ID、NAME和CategoryID。CATEGORY表有两列ID和NAME。。我尝试了下面这样
Does ORDER BY apply before or after DISTINCT?(ORDER BY适用于DISTINCT之前还是之后？)
In a MySQL query, when using the DISTINCT option, does ORDER BY apply after the duplicates are re
sql - 如何构建一个 sql 查询以返回 avg(price)、min(price)、max(price) 与 avg(order)、min(order)、max(order)
我想创建一个 sql 查询，为 2 个不同的查询一起返回结果。例如，我想要以下形式的结果:产品名称, avg(price), min(price), max(price), avg(order), m
sql - 使用 order by 时的动态 order by - 加速
我正在使用行号从存储过程中获取分页结果。我发现使用动态 case 语句列名称进行排序会减慢速度 - 但如果我对所有内容进行硬编码就可以了。有没有办法通过不使整个 sql 查询一个字符串并使用 SP
z-order-curve - 如何在范围搜索中使用Morton Order(z阶曲线)？
如何在范围搜索中使用Morton Order？在wiki中，在“使用一维数据结构进行范围搜索”段落中，它说 "the range being queried (x = 2, ..., 3, y =
javascript - Order By (alias) then Order by second sequelize
我正在使用 sequelize.js，我在使用 order 语句时遇到问题，我想先通过 if id 排序(如果我的 id 在该别名表中)，然后再排序.... order = [['alias', 'i
php - MySQL 查询末尾的 "ORDER BY order"导致问题
我有一个 php 脚本，它从数据库中提取内容并以某种方式打印它们。数据库有一个名为“order”的列标题，它的 INT 大小为 11。当我从数据库中获取数据时，我试图按数据库中的值“order”对内容
mysql - 更新 order by 子句排序不同，然后选择 order by
我有一个带有 ORDER BY 子句的 UPDATE 查询。我已将相同的查询复制到具有相同 ORDER BY 子句的 SELECT 中，但得到了不同的结果。更新查询: UPDATE t_locks

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - "Third order"Kneser-Key 平滑的正确实现(对于 Trigram 模型)