gpt4 book ai didi

python - TfIDf 矢量器权重

转载 作者:太空宇宙 更新时间:2023-11-03 14:21:11 27 4
gpt4 key购买 nike

您好,我有一个词形还原文本,其格式如 lemma 所示。我想获得每个单词的 TfIdf 分数,这是我编写的函数:

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

lemma=["'Ah", 'yes', u'say', 'softly', 'Harry',
'Potter', 'Our', 'new', 'celebrity', 'You',
'learn', 'subtle', 'science', 'exact', 'art',
'potion-making', u'begin', 'He', u'speak', 'barely',
'whisper', 'caught', 'every', 'word', 'like',
'Professor', 'McGonagall', 'Snape', 'gift',
u'keep', 'class', 'silent', 'without', 'effort',
'As', 'little', 'foolish', 'wand-waving', 'many',
'hardly', 'believe', 'magic', 'I', 'dont', 'expect', 'really',
'understand', 'beauty']

def Tfidf_Vectorize(lemmas_name):

vect = TfidfVectorizer(stop_words='english',ngram_range=(1,2))
vect_transform = vect.fit_transform(lemmas_name)

# First approach of creating a dataframe of weight & feature names

vect_score = np.asarray(vect_transform.mean(axis=0)).ravel().tolist()
vect_array = pd.DataFrame({'term': vect.get_feature_names(), 'weight': vect_score})
vect_array.sort_values(by='weight',ascending=False,inplace=True)

# Second approach of getting the feature names

vect_fn = np.array(vect.get_feature_names())
sorted_tfidf_index = vect_transform.max(0).toarray()[0].argsort()

print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))

return vect_array

tf_dataframe=Tfidf_Vectorize(lemma)
print(tf_dataframe.iloc[:5,:])

我得到的输出:

print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))

Largest Tfidf: 
[u'yes' u'fools' u'fury' u'gale' u'ghosts' u'gift' u'glory' u'glow' u'good'
u'granger']

tf_dataframe的结果

       term  weight
261 snape 0.027875
238 say 0.022648
211 potter 0.013937
181 mind 0.010453
123 harry 0.010453
60 dark 0.006969
75 dumbledore 0.006969
311 voice 0.005226
125 head 0.005226
231 ron 0.005226

这两种方法难道不应该产生相同的顶级特征结果吗?我只想计算 tfidf 分数并获得前 5 个特征/权重。我做错了什么?

最佳答案

我不确定我在这里看到的是什么,但我感觉您错误地使用了 TfidfVectorizer。但是,如果我对您正在尝试的内容有错误的想法,请纠正我。

所以..您需要的是提供给 fit_transform() 的文档列表。由此,您可以构建一个矩阵,例如,每列代表一个文档,每行代表一个单词。该矩阵中的一个单元格是文档 j 中单词 i 的 tf-idf 分数。

这是一个例子:

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
"This is a document.",
"This is another document with slightly more text.",
"Whereas this is yet another document with even more text than the other ones.",
"This document is awesome and also rather long.",
"The car he drove was red."
]

document_names = ['Doc {:d}'.format(i) for i in range(len(documents))]

def get_tfidf(docs, ngram_range=(1,1), index=None):
vect = TfidfVectorizer(stop_words='english', ngram_range=ngram_range)
tfidf = vect.fit_transform(documents).todense()
return pd.DataFrame(tfidf, columns=vect.get_feature_names(), index=index).T

print(get_tfidf(documents, ngram_range=(1,2), index=document_names))

这会给你:

                    Doc 0     Doc 1     Doc 2     Doc 3     Doc 4
awesome 0.0 0.000000 0.000000 0.481270 0.000000
awesome long 0.0 0.000000 0.000000 0.481270 0.000000
car 0.0 0.000000 0.000000 0.000000 0.447214
car drove 0.0 0.000000 0.000000 0.000000 0.447214
document 1.0 0.282814 0.282814 0.271139 0.000000
document awesome 0.0 0.000000 0.000000 0.481270 0.000000
document slightly 0.0 0.501992 0.000000 0.000000 0.000000
document text 0.0 0.000000 0.501992 0.000000 0.000000
drove 0.0 0.000000 0.000000 0.000000 0.447214
drove red 0.0 0.000000 0.000000 0.000000 0.447214
long 0.0 0.000000 0.000000 0.481270 0.000000
ones 0.0 0.000000 0.501992 0.000000 0.000000
red 0.0 0.000000 0.000000 0.000000 0.447214
slightly 0.0 0.501992 0.000000 0.000000 0.000000
slightly text 0.0 0.501992 0.000000 0.000000 0.000000
text 0.0 0.405004 0.405004 0.000000 0.000000
text ones 0.0 0.000000 0.501992 0.000000 0.000000
<小时/>

您展示的获取单词及其各自分数的两种方法分别计算所有文档的平均值并获取每个单词的最大分数。

所以让我们这样做并比较这两种方法:

df = get_tfidf(documents, ngram_range=(1,2), index=index)

print(pd.DataFrame([df.mean(1), df.max(1)], index=['score_mean', 'score_max']).T)

我们可以看到分数当然是不同的。

                   score_mean  score_max
awesome 0.096254 0.481270
awesome long 0.096254 0.481270
car 0.089443 0.447214
car drove 0.089443 0.447214
document 0.367353 1.000000
document awesome 0.096254 0.481270
document slightly 0.100398 0.501992
document text 0.100398 0.501992
drove 0.089443 0.447214
drove red 0.089443 0.447214
long 0.096254 0.481270
ones 0.100398 0.501992
red 0.089443 0.447214
slightly 0.100398 0.501992
slightly text 0.100398 0.501992
text 0.162002 0.405004
text ones 0.100398 0.501992
<小时/>

注意:

您可以说服自己,这与在 TfidfVectorizer 上调用 min/max 的作用相同:

vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
tfidf = vect.fit_transform(documents)
print(tfidf.max(0))
print(tfidf.mean(0))

关于python - TfIDf 矢量器权重,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47932178/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com