python - 从头开始实现 TF-IDF 矢量器-6ren

python - 从头开始实现 TF-IDF 矢量器

转载作者：行者123 更新时间：2023-11-30 09:41:46

24

4

我正在尝试在 Python 中从头开始实现 tf-idf 矢量器。我计算了 TDF 值，但这些值与使用 sklearn 的 TfidfVectorizer() 计算的 TDF 值不匹配。

我做错了什么？

corpus = [
 'this is the first document',
 'this document is the second document',
 'and this is the third one',
 'is this the first document',
]

from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy

sentence = []
for i in range(len(corpus)):
sentence.append(corpus[i].split())

word_freq = {}   #calculate document frequency of a word
for i in range(len(sentence)):
    tokens = sentence[i]
    for w in tokens:
        try:
            word_freq[w].add(i)  #add the word as key 
        except:
            word_freq[w] = {i}  #if it exists already, do not add.

for i in word_freq:
    word_freq[i] = len(word_freq[i])  #Counting the number of times a word(key)is in the whole corpus thus giving us the frequency of that word.

def idf():
    idfDict = {}
    for word in word_freq:
        idfDict[word] = math.log(len(sentence) / word_freq[word])
    return idfDict
idfDict = idf()

预期输出:(使用vectorizer.idf_获得的输出)

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073 1.22314355 1.91629073 1.        ]

实际输出:(值为对应key的idf值。

{'and': 1.3862943611198906,
'document': 0.28768207245178085,
'first': 0.6931471805599453,
'is': 0.0,
'one': 1.3862943611198906,
'second': 1.3862943611198906,
'the': 0.0,
'third': 1.3862943611198906,
'this': 0.0
 }

最佳答案

有一些默认参数可能会影响 sklearn 的计算内容，但这里似乎最重要的一个特定参数是:

smooth_idf: bool 值(默认=True)通过向文档频率加一来平滑 idf 权重，就好像看到一个额外的文档包含集合中的每个术语一次。防止零除。

如果您从每个元素中减去 1 并将 e 提高到该次幂，则对于较低的 n 值，您会得到非常接近 5/n 的值:

1.91629073 => 5/2
1.22314355 => 5/4
1.51082562 => 5/3
1 => 5/5

无论如何，没有一个 tf-idf 实现；您定义的指标只是一个尝试观察某些属性的启发式方法(例如“较高的 idf 应该与语料库中的稀有性相关”)，因此我不会太担心实现相同的实现。

sklearn 似乎使用了:log((document_length + 1)/(单词频率 + 1)) + 1这就像有一个文档包含语料库中的每个单词。

编辑:最后一段由 TfIdfNormalizer 的文档字符串证实。 .

关于python - 从头开始实现 TF-IDF 矢量器，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57749696/

24

4

0

文章推荐： python - keras 中输入数据不兼容错误，维度不匹配 ValueError

文章推荐： python - scikit learn 的 train_test_split( ) 方法

文章推荐： python - split() 缺少 1 个必需的位置参数 : 'y'

python - 从头 Python 构建包含数据的表
我需要用这样的数据构建一个表: ┌────────┬───────────┬────────┐ │ ID │ Name │ Age │ ├────

首页

博学

6Ren·AI

商城

python - 从头开始实现 TF-IDF 矢量器