gpt4 book ai didi

python - 这是正确的 tfidf 吗?

转载 作者:太空宇宙 更新时间:2023-11-03 10:53:30 32 4
gpt4 key购买 nike

我正在尝试从文档中获取 tfidf。但我认为它没有给我正确的值(value)观,或者我可能做错了什么。请建议。代码和输出如下:

from sklearn.feature_extraction.text import TfidfVectorizer
books = ["Hello there this is first book to be read by wordcount script.", "This is second book to be read by wordcount script. It has some additionl information.", "just third book."]
vectorizer = TfidfVectorizer()
response = vectorizer.fit_transform(books)
feature_names = vectorizer.get_feature_names()
for col in response.nonzero()[1]:
print feature_names[col], '-', response[0, col]

更新 1:(根据 juanpa.arrivillaga 的建议)

vectorizer = TfidfVectorizer(smooth_idf=False)

输出:

script - 0.269290317245
wordcount - 0.269290317245
by - 0.269290317245
read - 0.269290317245
be - 0.269290317245
to - 0.269290317245
book - 0.209127954024
first - 0.354084405732
is - 0.269290317245
this - 0.269290317245
there - 0.354084405732
hello - 0.354084405732
information - 0.0
...

更新 1 后的输出:

script - 0.256536760895
wordcount - 0.256536760895
by - 0.256536760895
read - 0.256536760895
be - 0.256536760895
to - 0.256536760895
book - 0.182528018244
first - 0.383055542114
is - 0.256536760895
this - 0.256536760895
there - 0.383055542114
hello - 0.383055542114
information - 0.0
...

根据我的理解,tfidf = tf * idf。以我手动计算的方式为例:

文档 1:“您好,这是第一本使用 wordcount 脚本阅读的书。”文件 2:“这是 wordcount 脚本阅读的第二本书。它有一些附加信息。”文档 3:“只是第三本书。”

Tfidf 表示你好:

tf= 1/12(total terms in document 1)= 0.08333333333
idf= log(3(total documents)/1(no. of document with term in it))= 0.47712125472
0.08333333333*0.47712125472= 0.03976008865

这与下面不同(你好 - 0.354084405732)。

更新1后手动计算:

tf = 1
idf= log(nd/df) +1 = log (3/1) +1= 0.47712125472 + 1= 1.47712
tfidf = tf*idf = 1* 1.47712= 1.47712

(与 idf 平滑后的代码输出“hello - 0.383055542114”不同)

非常感谢任何帮助理解正在发生的事情..

最佳答案

这是一个没有平滑或归一化的输出:

In [2]: from sklearn.feature_extraction.text import TfidfVectorizer
...: books = ["Hello there this is first book to be read by wordcount script.", "This is second book to be read by wordcount sc
...: ript. It has some additionl information.", "just third book."]
...: vectorizer = TfidfVectorizer(smooth_idf=False, norm=None)
...: response = vectorizer.fit_transform(books)
...: feature_names = vectorizer.get_feature_names()
...: for col in response.nonzero()[1]:
...: print(feature_names[col], '-', response[0, col])
...:
hello - 2.09861228867
there - 2.09861228867
this - 1.40546510811
is - 1.40546510811
first - 2.09861228867
book - 1.0
to - 1.40546510811
be - 1.40546510811
read - 1.40546510811
by - 1.40546510811
wordcount - 1.40546510811
script - 1.40546510811
this - 1.40546510811
is - 1.40546510811
book - 1.0
to - 1.40546510811
be - 1.40546510811
read - 1.40546510811
by - 1.40546510811
wordcount - 1.40546510811
script - 1.40546510811
second - 0.0
it - 0.0
has - 0.0
some - 0.0
additionl - 0.0
information - 0.0
book - 1.0
just - 0.0
third - 0.0

考虑一下 "hello" 的结果:

hello - 2.09861228867

现在,手动:

In [3]: import math

In [4]: tf = 1

In [5]: idf = math.log(3/1) + 1

In [6]: tf*idf
Out[6]: 2.09861228866811

您手动计算的问题在于您使用的是log base 10,但您需要使用自然对数。

如果您仍然强烈希望完成平滑和归一化步骤,这应该让您做好正确的准备。

关于python - 这是正确的 tfidf 吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45680421/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com