gpt4 book ai didi

python - TfidfVectorizer 删除 tf-idf 分数为零的特征

转载 作者:太空宇宙 更新时间:2023-11-03 16:36:13 29 4
gpt4 key购买 nike

我想使用 python 对文档进行聚类。首先,我生成文档 x 术语矩阵,其 tf-idf 分数如下:

tfidf_vectorizer_desc = TfidfVectorizer(min_df=1, max_df=0.9,use_idf=True, tokenizer=tokenize_and_stem)
%time tfidf_matrix_desc = tfidf_vectorizer_desc.fit_transform(descriptions) #fit the vectorizer to text
desc_feature_names = tfidf_vectorizer_desc.get_feature_names()

矩阵形状为(1510, 6862)

第一个文档的每个术语的得分:

dense = tfidf_matrix_desc.todense()
print(len(dense[0].tolist()[0]))
dataset0 = dense[0].tolist()[0]
phrase_scores = [pair for pair in zip(range(0, len(dataset0)), dataset0) if pair[1] > 0]
print(len(phrase_scores))

输出:

  • print(len(dense[0].tolist()[0])) -> 6862
  • 打印(len(phrase_scores)) -> 48*第一个文档只有 48 个大于 0.0 的项。

现在我想从矩阵中识别给定数据集 tfidf 得分为 0 的所有特征(术语)。我怎样才能实现这个目标?

for col in tfidf_matrix_desc.nonzero()[1]:
print(feature_names[col], ' - ', tfidf_matrix[0, col])

最佳答案

以防万一有人需要类似的东西,我使用的是以下内容:

# Xtr is the output sparse matrix from TfidfVectorizer
# min_tfidf is a threshold for defining the "new" 0
def remove_zero_tf_idf(Xtr, min_tfidf=0.04):
D = Xtr.toarray() # convert to dense if you want
D[D < min_tfidf] = 0
tfidf_means = np.mean(D, axis=0) # find features that are 0 in all documents
D = np.delete(D, np.where(tfidf_means == 0)[0], axis=1) # delete them from the matrix
return D

关于python - TfidfVectorizer 删除 tf-idf 分数为零的特征,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37176378/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com