python - 如何更正我的朴素贝叶斯方法返回极小的条件概率？-6ren

python - 如何更正我的朴素贝叶斯方法返回极小的条件概率？

转载作者：塔克拉玛干更新时间：2023-11-03 03:01:22

我正在尝试使用朴素贝叶斯计算电子邮件是垃圾邮件的概率。我有一个文档类来创建文档(从网站输入)，另一个类来训练和分类文档。我的训练函数计算所有文档中的所有唯一项，垃圾邮件类中的所有文档，非垃圾邮件类中的所有文档，计算先验概率(一个用于垃圾邮件，另一个用于火腿)。然后我使用以下公式将每个术语的条件概率存储到字典中

Tct = 给定类别中术语出现的次数
Tct' 是给定类中术语中的 # 术语
B' = # 所有文档中的唯一术语

classes = 垃圾邮件或火腿
垃圾邮件 = 垃圾邮件，火腿 = 非垃圾邮件

问题是，当我在我的代码中使用这个公式时，它给我的条件概率分数非常小，例如 2.461114392596968e-05。我很确定这是因为与 Tct'(火腿为 64878，垃圾邮件为 308930)和 B'(为 16386)的分母值相比，Tct 的值非常小(如 5 或 8)。我无法弄清楚如何将 condprob 分数降低到 .00034155 之类的值，因为我只能假设我的 condprob 分数不应该像现在这样小到指数级。我的计算有问题吗？这些值实际上应该这么小吗？
如果有帮助，我的目标是对一组测试文档进行评分，并获得 327.82、758.80 或 138.66 这样的结果
使用这个公式

然而，使用我的小 condprob 值我只能得到负数。

代码

-创建文档

class Document(object):
"""
The instance variables are:
filename....The path of the file for this document.
label.......The true class label ('spam' or 'ham'), determined by whether the filename contains the string 'spmsg'
tokens......A list of token strings.
"""

def __init__(self, filename=None, label=None, tokens=None):
    """ Initialize a document either from a file, in which case the label
    comes from the file name, or from specified label and tokens, but not
    both.
    """
    if label: # specify from label/tokens, for testing.
        self.label = label
        self.tokens = tokens
    else: # specify from file.
        self.filename = filename
        self.label = 'spam' if 'spmsg' in filename else 'ham'
        self.tokenize()

def tokenize(self):
    self.tokens = ' '.join(open(self.filename).readlines()).split()

-朴素贝叶斯

class NaiveBayes(object):
def train(self, documents):
    """
    Given a list of labeled Document objects, compute the class priors and
    word conditional probabilities, following Figure 13.2 of your
    book. Store these as instance variables, to be used by the classify
    method subsequently.
    Params:
      documents...A list of training Documents.
    Returns:
      Nothing.
    """
    ###TODO
    unique = []
    proxy = []
    proxy2 = []
    proxy3 = []
    condprob = [{},{}]
    Tct = defaultdict()
    Tc_t = defaultdict()
    prior = {}
    count = 0
    oldterms = []
    old_terms = []
    for a in range(len(documents)):
        done = False
        for item in documents[a].tokens:
            if item not in unique:
                unique.append(item)
            if documents[a].label == "ham":
                proxy2.append(item)
                if done == False:
                    count += 1
            elif documents[a].label == "spam":
                proxy3.append(item)
            done = True
    V = unique
    N = len(documents)
    print("N:",N)
    LB = len(unique)
    print("THIS IS LB:",LB)
    self.V = V
    print("THIS IS COUNT/NC", count)
    Nc = count
    prior["ham"] = Nc / N
    self.prior = prior
    Nc = len(documents) - count
    print("THIS IS SPAM COUNT/NC", Nc)
    prior["spam"] = Nc / N
    self.prior = prior
    text2 = proxy2
    text3 = proxy3
    TctTotal = len(text2)
    Tc_tTotal = len(text3)
    print("THIS IS TCTOTAL",TctTotal)
    print("THIS IS TC_TTOTAL",Tc_tTotal)
    for term in text2:
        if term not in oldterms:
            Tct[term] = text2.count(term)
            oldterms.append(term)
    for term in text3:
        if term not in old_terms:
            Tc_t[term] = text3.count(term)
            old_terms.append(term)
    for term in V:
        if term in text2:
            condprob[0].update({term: (Tct[term] + 1) / (TctTotal + LB)})
        if term in text3:
            condprob[1].update({term: (Tc_t[term] + 1) / (Tc_tTotal + LB)})
    print("This is condprob", condprob)
    self.condprob = condprob

def classify(self, documents):
    """ Return a list of strings, either 'spam' or 'ham', for each document.
    Params:
      documents....A list of Document objects to be classified.
    Returns:
      A list of label strings corresponding to the predictions for each document.
    """
    ###TODO
    #return list["string1", "string2", "stringn"]
    # docs2 = ham, condprob[0] is ham
    # docs3 = spam, condprob[1] is spam
    unique = []
    ans = []
    hscore = 0
    sscore = 0
    for a in range(len(documents)):
        for item in documents[a].tokens:
            if item not in unique:
                unique.append(item)
        W = unique
        hscore = math.log(float(self.prior['ham']))
        sscore = math.log(float(self.prior['spam']))
        for t in W:
            try:
                hscore += math.log(self.condprob[0][t])
            except KeyError:
                continue
            try:
                sscore += math.log(self.condprob[1][t])
            except KeyError:
                continue
        print("THIS IS SSCORE",sscore)
        print("THIS IS HSCORE",hscore)
        unique = []
        if hscore > sscore:
            str = "Spam"
        elif sscore > hscore:
            str = "Ham"
        ans.append(str)

    return ans

-测试

if not os.path.exists('train'):  # download data
from urllib.request import urlretrieve
import tarfile

urlretrieve('http://cs.iit.edu/~culotta/cs429/lingspam.tgz', 'lingspam.tgz')
tar = tarfile.open('lingspam.tgz')
tar.extractall()
tar.close()
train_docs = [Document(filename=f) for f in glob.glob("train/*.txt")]
test_docs = [Document(filename=f) for f in glob.glob("test/*.txt")]
test = train_docs

nb = NaiveBayes()
nb.train(train_docs[1500:])
#uncomment when testing classify()
#predictions = nb.classify(test_docs[:200])
#print("PREDICTIONS",predictions)

最终目标是能够将文档分类为垃圾邮件或非垃圾邮件，但我想先解决条件概率问题。

问题
条件概率值应该这么小吗？如果是这样，为什么我通过分类得到奇怪的分数？如果不是，我该如何修复我的代码以提供正确的 condprob 值？

值(value)观
我得到的当前 condprob 值是这样的:

“传统”:2.461114392596968e-05，“菲尔莫”:2.461114392596968e-05，“796”:2.461114392596968e-05，“赞”:2.461114392596968e-05

condprob 是一个包含两个词典的列表，第一个是 ham，下一个是 spam。每个字典都将一个术语映射到它的条件概率。我想要“正常”的小值，例如 .00031235 而不是 3.1235e-05。这样做的原因是，当我通过带有一些测试文档的分类方法运行 condprob 值时，我得到的分数如

这是 HSCORE -2634.5292392650663，这是 SSCORE -1707.983339196181

当它们应该看起来像

这是 HSCORE 327.82，这是 SSCORE 758.80

运行时间

~1 分 30 秒

最佳答案

(您似乎在使用对数概率，这非常明智，但我打算为原始概率编写以下大部分内容，您可以通过采用对数概率的指数来获得原始概率，因为它使代数更容易，即使在实践中确实如此，这意味着如果您不使用日志，您可能会遇到数字下溢)

据我所知，从您的代码可以看出，您从先验概率 p(Ham) 和 p(Spam) 开始，然后使用根据先前数据估计的概率计算出 p(Ham) * p(Observed data | Ham) 和p(垃圾邮件) * p(观察到的数据 | 垃圾邮件)。

贝叶斯定理重新排列 p(Obs|Spam) = p(Obs & Spam)/p(Spam) = p(Obs) p(Spam|Obs)/p(Spam) 给你 P(Spam|Obs) = p(Spam) p(Obs|Spam)/p(Obs) 并且您似乎计算了 p(Spam) p(Obs|Spam) = p(Obs & Spam) 但没有除以 p(Obs)。由于只有两种可能性，Ham 和 Spam，最简单的做法可能是注意 p(Obs) = p(Obs & Spam) + p(Obs & Ham) 因此只需将两个计算值中的每一个除以它们的总和，本质上是缩放值，以便它们确实总和为 1.0。

如果您从对数概率 lA 和 lB 开始，这种缩放会比较棘手。为了缩放这些，我首先会通过将它们都缩放为对数的粗略值来将它们纳入范围，因此进行减法

lA = lA - 最大值(lA, lB)

lB = lB - 最大值(lA, lB)

现在至少两者中较大的一个不会溢出。较小的仍然可能，但我宁愿处理下溢而不是溢出。现在把它们变成不完全按比例缩放的概率:

pA = exp(lA)

pB = exp(lB)

并适当缩放，使它们相加为零

真实PA = pA/(pA + pB)

真实的 PB = pB/(pA + pB)

关于python - 如何更正我的朴素贝叶斯方法返回极小的条件概率？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37093822/