gpt4 book ai didi

python - 了解 TfidfVectorizer 输出

转载 作者:行者123 更新时间:2023-12-04 08:03:10 27 4
gpt4 key购买 nike

我正在测试 TfidfVectorizer举个简单的例子,我想不出结果。

corpus = ["I'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)

print(vect.get_feature_names())
print(tfidf.shape)
print(tfidf)
输出:
['apple', 'away', 'blue', 'compare', 'day', 'docs', 'doctor', 'keeps', 'learn', 'like', 'orange', 'prefer', 'scikit']
(5, 13)
(0, 0) 0.5564505207186616
(0, 9) 0.830880748357988
...
我正在计算 tfidf第一句话,我得到了不同的结果:
  • 第一个文档(“I'd like an apple”)仅包含 2 个词(在去除停用词之后(根据 vect.get_feature_names() 的打印结果(我们保留:“like”、“apple”)
  • )
  • TF(“苹果”,文档_1)= 1/2 = 0.5
  • TF("like", Doucment_1) = 1/2 = 0.5
  • apple在语料库中出现 3 次。
  • like在语料库中出现 1 次。
  • IDF(“苹果”)= ln(5/3)= 0.51082
  • IDF(“喜欢”)= ln(5/1) = 1.60943

  • 所以:
  • tfidf("apple")在文档 1 = 0.5 * 0.51082 = 0.255 != 0.5564
  • tfidf("like")在文档 1 = 0.5 * 1.60943 = 0.804 != 0.8308

  • 我错过了什么?

    最佳答案

    你的计算有几个问题。
    第一 ,关于如何计算 TF 有多种约定(参见 Wikipedia entry); scikit-learn 不会使用文档长度对其进行标准化。来自 user guide :

    [...] the term frequency, the number of times a term occurs in a given document [...]


    所以,在这里, TF("apple", Document_1) = 1 ,而不是 0.5
    第二 ,关于 IDF 定义 - 来自 docs :

    If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.


    所以,这里我们将有
    IDF ("apple") = ln(5+1/3+1) + 1 = 1.4054651081081644
    因此
    TF-IDF("apple") = 1 * 1.4054651081081644 =  1.4054651081081644
    第三 , 使用默认设置 norm='l2' ,有一个额外的规范化发生;再次来自文档:

    Normalization is “c” (cosine) when norm='l2', “n” (none) when norm=None.


    从您的示例中明确删除此额外的规范化,即
    vect = TfidfVectorizer(min_df=1, stop_words="english", norm=None)
    'apple'
    (0, 0)  1.4054651081081644
    即已经手动计算
    有关规范化在 norm='l2' 时如何影响计算的详细信息(默认设置),参见 Tf–idf term weighting用户指南的部分;他们自己承认:

    the tf-idfs computed in scikit-learn’s TfidfTransformer and TfidfVectorizer differ slightly from the standard textbook notation

    关于python - 了解 TfidfVectorizer 输出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66350670/

    27 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com