gpt4 book ai didi

python - 使用 python 计算文本文档的逐点互信息

转载 作者:行者123 更新时间:2023-11-30 23:31:18 25 4
gpt4 key购买 nike

我的目标是计算以下文本的 PMI:a=“当被告和他的律师走进法庭时,一些受害者的支持者背弃了他

formula: PMI-IR (w1, w2) = log2 p(w1&w2)/p(w1)*p(w2); p=probability, w=word 

My attempt:
>>> from nltk import bigrams
>>> import collections
>>> a1=a.split()
>>> a2=collections.Counter(a1)
>>> a3=collections.Counter(bigrams(a1))
>>> a4=sum([a2[x]for x in a2])
>>> a5=sum([a3[x]for x in a3])
>>> a6={x:float(a2[x])/a4 for x in a2} # word probabilities(w1 and w2)
>>> a7={x:float(a3[x])/a5 for x in a3} # joint probabilites (w1&w2)
>>> for x in a6:
k={x:round(log(a7[b]/(a6[x] * a6[y]),2),4) for b in a7 for y in a6 if x and y in b}
u.append(k)
>>> u
[{'and': 4.3959}, {'on': 4.3959}, {'his': 4.3959}, {'When': 4.3959}.....}]

由于以下原因,我得到的结果似乎不正确(1)我想要一本大字典,但每个项目都有很多小字典。(2)概率可能没有正确地拟合到方程中,因为这是我第一次尝试解决这个问题。

有什么建议吗?谢谢。

最佳答案

我不是 NLP 专家,但你的方程看起来不错。该实现有一个微妙的错误。考虑以下优先级深入研究:

"""Precendence deep dive"""
'hi' and True #returns true regardless of what the contents of the string
'hi' and False #returns false
b = ('hi','bob')
'hi' and 'bob' in b #returns true BUT not because 'hi' is in b!!!
'hia' and 'bob' in b #returns true as the precedence is 'hia' and ('bob' in b)
result2 = 'bob' in b
'hia' and result2 #returns true and shows the precedence more clearly
'hi' and 'boba' in b #returns false

#each string needs to check in b
'hi' in b and 'bob' in b #return true!!
'hia' in b and 'bob' in b #return false!!
'hi' in b and 'boba' in b #return false!! - same as before but now each string is checked separately

注意联合概率 u 和 v 的差异。u 包含错误的优先级,v 包含正确的优先级

from nltk import bigrams
import collections

a= """When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him. if we have more data then it will be more interesting because we have more chance to repeat bigrams. After some of the victim supporters turned their backs then a subset of the victim supporters turned around and left the court."""

a1=a.split()
a2=collections.Counter(a1)

a3=collections.Counter(bigrams(a1))
a4=sum([a2[x]for x in a2])
a5=sum([a3[x]for x in a3])
a6={x:float(a2[x])/a4 for x in a2} # word probabilities(w1 and w2)
a7={x:float(a3[x])/a5 for x in a3} # joint probabilites (w1&w2)
u = {}
v = {}
for x in a6:
k={x:round(math.log((a7[b]/(a6[x] * a6[y])),2),4) for b in a7 for y in a6 if x and y in b}
u[x] = k[x]
k={x:round(math.log((a7[b]/(a6[x] * a6[y])),2),4) for b in a7 for y in a6 if x in b and y in b}
v[x] = k[x]

u['the']
v['the']

关于python - 使用 python 计算文本文档的逐点互信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20018730/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com