gpt4 book ai didi

Python - 使用逐点互信息进行情感分析

转载 作者:太空狗 更新时间:2023-10-29 17:56:47 27 4
gpt4 key购买 nike

from __future__ import division
import urllib
import json
from math import log


def hits(word1,word2=""):
query = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s"
if word2 == "":
results = urllib.urlopen(query % word1)
else:
results = urllib.urlopen(query % word1+" "+"AROUND(10)"+" "+word2)
json_res = json.loads(results.read())
google_hits=int(json_res['responseData']['cursor']['estimatedResultCount'])
return google_hits


def so(phrase):
num = hits(phrase,"excellent")
#print num
den = hits(phrase,"poor")
#print den
ratio = num / den
#print ratio
sop = log(ratio)
return sop

print so("ugly product")

我需要此代码来计算可用于将评论分类为正面或负面的逐点互信息。基本上我使用的是 Turney(2002) 指定的技术:http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf作为用于情感分析的无监督分类方法的示例。

正如论文中所解释的,如果短语与“差”这个词的关联度更高,那么该短语的语义方向就是负面的;如果它与“优秀”这个词的关联度更高,那么它的语义方向就是正面的。

上面的代码计算了一个短语的 SO。我使用 Google 计算命中数并计算 SO。(因为 AltaVista 现在不存在)

计算的值非常不稳定。他们不拘泥于特定的模式。例如,SO(“丑陋的产品”)结果是 2.85462098541,而 SO(“漂亮的产品”)是 1.71395061117。而前者预计为负而另一个为正。

代码有问题吗?有没有更简单的方法来计算短语的 SO(使用 PMI)与任何 Python 库,比如 NLTK?我尝试了 NLTK,但无法找到任何计算 PMI 的显式方法。

最佳答案

通常,计算 PMI 很棘手,因为公式会根据您要考虑的 ngram 的大小而变化:

在数学上,对于双字母组,您可以简单地考虑:

log(p(a,b) / ( p(a) * p(b) ))

以编程方式,假设您已经计算了语料库中一元字母和二元字母的所有频率,您可以这样做:

def pmi(word1, word2, unigram_freq, bigram_freq):
prob_word1 = unigram_freq[word1] / float(sum(unigram_freq.values()))
prob_word2 = unigram_freq[word2] / float(sum(unigram_freq.values()))
prob_word1_word2 = bigram_freq[" ".join([word1, word2])] / float(sum(bigram_freq.values()))
return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

这是来自 MWE 库的代码片段,但它处于预开发阶段 (https://github.com/alvations/Terminator/blob/master/mwe.py)。但请注意,它用于并行 MWE 提取,因此您可以“破解”它以提取单语 MWE:

$ wget https://dl.dropboxusercontent.com/u/45771499/mwe.py
$ printf "This is a foo bar sentence .\nI need multi-word expression from this text file.\nThe text file is messed up , I know you foo bar multi-word expression thingy .\n More foo bar is needed , so that the text file is populated with some sort of foo bar bigrams to extract the multi-word expression ." > src.txt
$ printf "" > trg.txt
$ python
>>> import codecs
>>> from mwe import load_ngramfreq, extract_mwe

>>> # Calculates the unigrams and bigrams counts.
>>> # More superfluously, "Training a bigram 'language model'."
>>> unigram, bigram, _ , _ = load_ngramfreq('src.txt','trg.txt')

>>> sent = "This is another foo bar sentence not in the training corpus ."

>>> for threshold in range(-2, 4):
... print threshold, [mwe for mwe in extract_mwe(sent.strip().lower(), unigram, bigram, threshold)]

[输出]:

-2 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
-1 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
0 ['this is', 'foo bar', 'bar sentence']
1 ['this is', 'foo bar', 'bar sentence']
2 ['this is', 'foo bar', 'bar sentence']
3 ['foo bar', 'bar sentence']
4 []

有关更多详细信息,我发现这篇论文是对 MWE 提取的快速而简单的介绍:“扩展对数似然度量以改进搭配识别”,请参阅 http://goo.gl/5ebTJJ

关于Python - 使用逐点互信息进行情感分析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22118350/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com