gpt4 book ai didi

algorithm - 基本概念 : Naive Bayes algorithm for classification

转载 作者:塔克拉玛干 更新时间:2023-11-03 06:25:53 26 4
gpt4 key购买 nike

我想我或多或少了解朴素贝叶斯,但我对它在简单二进制文本分类中的实现有几个问题。

假设文档 D_i是词汇表的一些子集 x_1, x_2, ...x_n

有两个类c_i任何文档都可以落在上面,我想计算 P(c_i|D)对于与 P(D|c_i)P(c_i) 成正比的某些输入文档 D

我有三个问题

  1. P(c_i)#docs in c_i/ #total docs#words in c_i/ #total words
  2. 应该P(x_j|c_i)成为#times x_j appears in D/ #times x_j appears in c_i
  3. 假设一个 x_j不存在于训练集中,我是否给它一个概率 1,这样它就不会改变计算?

例如,假设我有一个训练集:

training = [("hello world", "good")
("bye world", "bad")]

所以类会有

good_class = {"hello": 1, "world": 1}
bad_class = {"bye":1, "world:1"}
all = {"hello": 1, "world": 2, "bye":1}

所以现在如果我想计算一个测试字符串是好的概率

test1 = ["hello", "again"]
p_good = sum(good_class.values())/sum(all.values())
p_hello_good = good_class["hello"]/all["hello"]
p_again_good = 1 # because "again" doesn't exist in our training set

p_test1_good = p_good * p_hello_good * p_again_good

最佳答案

由于这个问题太宽泛,所以我只能以有限的方式回答:-

1st:- P(c_i) 是 #docs in c_i/#total docs 或 #words in c_i/#total words

P(c_i) = #c_i/#total docs

2nd:- P(x_j|c_i) 应该是 #times x_j appears in D/#times x_j appears in c_i.
@larsmans 注意到之后..

It is exactly occurrence of word in a document
by total number of words in that class in whole dataset.

3rd:-假设 x_j 不存在于训练集中,我是否给它一个概率 1,这样它就不会改变计算?

For That we have laplace correction or Additive smoothing. It is applied on
p(x_j|c_i)=(#times x_j appears in D+1)/ (#times x_j +|V|) which will neutralize
the effect not occurring features.

关于algorithm - 基本概念 : Naive Bayes algorithm for classification,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25959368/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com