gpt4 book ai didi

python - 如何获取语料库中某个单词的平均 TF-IDF 值?

转载 作者:行者123 更新时间:2023-12-01 00:35:53 25 4
gpt4 key购买 nike

我正在尝试获取整个语料库中某个单词的平均 TF-IDF 值。假设“stack”这个词在我们的语料库(几百个文档)中出现了 4 次。在找到的 4 个文档中,其值分别为 0.34、0.45、0.68、0.78。因此,整个语料库的平均 TF-IDF 值为 0.5625。我怎样才能找到文档中所有单词的这个?

我正在使用 TF-IDF 的 scikit-learn 实现。这是我用来获取每个文档的 TF-IDF 值的代码:

for i in docs_test:
feature_names=cv.get_feature_names()

doc=docs_test[itr]
itr += 1
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

sorted_items=sort_coo(tf_idf_vector.tocoo())

#Extracting the top 81 keywords along with their TF-IDF scores
keywords=extract_topn_from_vector(feature_names,sorted_items,81)

对于每次迭代,都会输出包含 81 个单词的字典及其该文档的 TF-IDF 分数:{'凯里': 0.396, '巴黎': 0.278, '法国': 0.252 ......}

由于我只输出前 81 个单词,因此我知道该文档中的所有单词都不会被覆盖。所以,我想要文档中前 81 个单词中每个单词的平均 TF-IDF 值(单词会重复)。

编辑:我尝试了@mijjiga的解决方案。结果如下:

{'the': 0.51203095036175, 'to': 0.36268858983957286, 'of': 0.3200193439760937, 'in': 0.256015475180875, 'he': 0.2133462293173958}
{'the': 0.5076730825668095, 'to': 0.3299875036684262, 'in': 0.3299875036684262, 'and': 0.30460384954008574, 'trump': 0.17768557889838335}
{'the': 0.5257856140532874, 'children': 0.292103118918493, 'to': 0.2336824951347944, 'winton': 0.2336824951347944, 'of': 0.2336824951347944}
{'the': 0.6082672845890075, 'to': 0.3146210092701763, 'trump': 0.2936462753188312, 'that': 0.23911196704533397, 'of': 0.21394228630371986}
{'the': 0.6285692218670833, 'to': 0.3610929572427925, 'of': 0.2139810116994326, 'that': 0.20060719846821806, 'iran': 0.18723338523700353}
{'the': 0.5730922466510651, 'clinton': 0.29578954665861423, 'of': 0.24032900666012408, 'in': 0.2218421599939607, 'that': 0.2218421599939607}
{'the': 0.7509270472649924, 'to': 0.34926839407674065, 'trump': 0.17463419703837033, 'of': 0.17463419703837033, 'delegates': 0.1571707773345333}
{'on': 0.4, 'administration': 0.2, 'through': 0.2, 'the': 0.2, 'tax': 0.2}
{'the': 0.5885277950982455, 'in': 0.3184973949943446, 'of': 0.3046496821685035, 'to': 0.29080196934266245, 'women': 0.2769542565168214}

正如我们所见,“the”这个词有多个值。如果我的问题没有表明这一点,我深表歉意,但我希望每个单词都有一个值。该值是该文档语料库中该单词的平均 TF-IDF 分数。关于如何让它发挥作用有什么帮助吗?谢谢!

这里是使用的代码:

from sklearn.feature_extraction.text import TfidfVectorizer 
import numpy as np
itr = 0
for i in range(1,10):
docs=docs_test[itr]
docs=[docs]
itr+=1
tfidf_vectorizer=TfidfVectorizer(use_idf=True)
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)

tfidf = tfidf_vectorizer_vectors.todense()
# TFIDF of words not in the doc will be 0, so replace them with nan
tfidf[tfidf == 0] = np.nan
# Use nanmean of numpy which will ignore nan while calculating the mean
means = np.nanmean(tfidf, axis=0)
# convert it into a dictionary for later lookup
means = dict(zip(tfidf_vectorizer.get_feature_names(), means.tolist()[0]))

tfidf = tfidf_vectorizer_vectors.todense()
# Argsort the full TFIDF dense vector
ordered = np.argsort(tfidf*-1)
words = tfidf_vectorizer.get_feature_names()

top_k = 5
for i, doc in enumerate(docs):
result = { }
# Pick top_k from each argsorted matrix for each doc
for t in range(top_k):
# Pick the top k word, find its average tfidf from the
# precomputed dictionary using nanmean and save it to later use
result[words[ordered[i,t]]] = means[words[ordered[i,t]]]
print (result )

最佳答案

文档是内联的

from sklearn.feature_extraction.text import TfidfVectorizer 
import numpy as np

docs=["the house had a tiny little mouse",
"the cat saw the mouse",
"the mouse ran away from the house",
"the cat finally ate the mouse",
"the end of the mouse story"
]

tfidf_vectorizer=TfidfVectorizer(use_idf=True)
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)

tfidf = tfidf_vectorizer_vectors.todense()
# TFIDF of words not in the doc will be 0, so replace them with nan
tfidf[tfidf == 0] = np.nan
# Use nanmean of numpy which will ignore nan while calculating the mean
means = np.nanmean(tfidf, axis=0)
# convert it into a dictionary for later lookup
means = dict(zip(tfidf_vectorizer.get_feature_names(), means.tolist()[0]))

tfidf = tfidf_vectorizer_vectors.todense()
# Argsort the full TFIDF dense vector
ordered = np.argsort(tfidf*-1)
words = tfidf_vectorizer.get_feature_names()

top_k = 5
for i, doc in enumerate(docs):
result = { }
# Pick top_k from each argsorted matrix for each doc
for t in range(top_k):
# Pick the top k word, find its average tfidf from the
# precomputed dictionary using nanmean and save it to later use
result[words[ordered[i,t]]] = means[words[ordered[i,t]]]
print (result )

输出

{'had': 0.4935620852501244, 'little': 0.4935620852501244, 'tiny': 0.4935620852501244, 'house': 0.38349121689490395, 'mouse': 0.24353457958557367}
{'saw': 0.5990921556092994, 'the': 0.4400321635416817, 'cat': 0.44898681252620987, 'mouse': 0.24353457958557367, 'ate': 0.5139230069660121}
{'away': 0.4570928721125019, 'from': 0.4570928721125019, 'ran': 0.4570928721125019, 'the': 0.4400321635416817, 'house': 0.38349121689490395}
{'ate': 0.5139230069660121, 'finally': 0.5139230069660121, 'the': 0.4400321635416817, 'cat': 0.44898681252620987, 'mouse': 0.24353457958557367}
{'end': 0.4917531872315962, 'of': 0.4917531872315962, 'story': 0.4917531872315962, 'the': 0.4400321635416817, 'mouse': 0.24353457958557367}

让我们解密结果[words[ordered[i,t]]] =means[words[ordered[i,t]]]

如果词汇大小为 v 并且文档数量为 n

  • ordered 是大小为 nxv 的矩阵。该矩阵的值是与词汇对应的索引,并且该矩阵根据每个文档的 TF-IDF 分数进行排序。
  • words 是词汇表中单词的列表大小 v。将此视为 id 到单词映射器
  • means 是一个大小为 v 的字典,每个值都是单词的平均 TF-IDF。

关于python - 如何获取语料库中某个单词的平均 TF-IDF 值?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57769123/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com