gpt4 book ai didi

python - 使用来自 sklearn.feature_extraction.text.TfidfVectorizer 的 TfidfVectorizer 计算 IDF

转载 作者:太空狗 更新时间:2023-10-29 21:45:41 28 4
gpt4 key购买 nike

我认为函数 TfidfVectorizer 没有正确计算 IDF 因子。例如,从 tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer 复制代码:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
"This is very nice"]
vectorizer = TfidfVectorizer(
use_idf=True, # utiliza o idf como peso, fazendo tf*idf
norm=None, # normaliza os vetores
smooth_idf=False, #soma 1 ao N e ao ni => idf = ln(N+1 / ni+1)
sublinear_tf=False, #tf = 1+ln(tf)
binary=False,
min_df=1, max_df=1.0, max_features=None,
strip_accents='unicode', # retira os acentos
ngram_range=(1,1), preprocessor=None, stop_words=None, tokenizer=None, vocabulary=None
)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

输出是:

{u'is': 1.0,
u'nice': 1.6931471805599454,
u'strange': 1.6931471805599454,
u'this': 1.0,
u'very': 1.0}`

但应该是:

{u'is': 0.0,
u'nice': 0.6931471805599454,
u'strange': 0.6931471805599454,
u'this': 0.0,
u'very': 0.0}

不是吗?我做错了什么?

而IDF的计算,根据http://www.tfidf.com/ , 是:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

因此,当术语“this”、“is”和“very”出现在两个句子中时,IDF = log_e (2/2) = 0。

术语“strange”和“nice”仅出现在两个文档之一中,因此 log_e(2/1) = 0,69314。

最佳答案

在 sklearn 实现中发生了两件您可能意想不到的事情:

  1. TfidfTransformersmooth_idf=True 作为默认参数
  2. 它的权重总是加1

所以它正在使用:

idf = log( 1 + samples/documents) + 1

这是源码:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L987-L992

编辑:您可以像这样子类化标准 TfidfVectorizer 类:

import scipy.sparse as sp
import numpy as np
from sklearn.feature_extraction.text import (TfidfVectorizer,
_document_frequency)
class PriscillasTfidfVectorizer(TfidfVectorizer):

def fit(self, X, y=None):
"""Learn the idf vector (global term weights)
Parameters
----------
X : sparse matrix, [n_samples, n_features]
a matrix of term/token counts
"""
if not sp.issparse(X):
X = sp.csc_matrix(X)
if self.use_idf:
n_samples, n_features = X.shape
df = _document_frequency(X)

# perform idf smoothing if required
df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)

# log+1 instead of log makes sure terms with zero idf don't get
# suppressed entirely.
####### + 1 is commented out ##########################
idf = np.log(float(n_samples) / df) #+ 1.0
#######################################################
self._idf_diag = sp.spdiags(idf,
diags=0, m=n_features, n=n_features)

return self

关于python - 使用来自 sklearn.feature_extraction.text.TfidfVectorizer 的 TfidfVectorizer 计算 IDF,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36756335/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com