您好,我有一个词形还原文本,其格式如 lemma
所示。我想获得每个单词的 TfIdf 分数,这是我编写的函数:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
lemma=["'Ah", 'yes', u'say', 'softly', 'Harry',
'Potter', 'Our', 'new', 'celebrity', 'You',
'learn', 'subtle', 'science', 'exact', 'art',
'potion-making', u'begin', 'He', u'speak', 'barely',
'whisper', 'caught', 'every', 'word', 'like',
'Professor', 'McGonagall', 'Snape', 'gift',
u'keep', 'class', 'silent', 'without', 'effort',
'As', 'little', 'foolish', 'wand-waving', 'many',
'hardly', 'believe', 'magic', 'I', 'dont', 'expect', 'really',
'understand', 'beauty']
def Tfidf_Vectorize(lemmas_name):
vect = TfidfVectorizer(stop_words='english',ngram_range=(1,2))
vect_transform = vect.fit_transform(lemmas_name)
# First approach of creating a dataframe of weight & feature names
vect_score = np.asarray(vect_transform.mean(axis=0)).ravel().tolist()
vect_array = pd.DataFrame({'term': vect.get_feature_names(), 'weight': vect_score})
vect_array.sort_values(by='weight',ascending=False,inplace=True)
# Second approach of getting the feature names
vect_fn = np.array(vect.get_feature_names())
sorted_tfidf_index = vect_transform.max(0).toarray()[0].argsort()
print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))
return vect_array
tf_dataframe=Tfidf_Vectorize(lemma)
print(tf_dataframe.iloc[:5,:])
我得到的输出:
print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))
是
Largest Tfidf:
[u'yes' u'fools' u'fury' u'gale' u'ghosts' u'gift' u'glory' u'glow' u'good'
u'granger']
tf_dataframe
的结果
term weight
261 snape 0.027875
238 say 0.022648
211 potter 0.013937
181 mind 0.010453
123 harry 0.010453
60 dark 0.006969
75 dumbledore 0.006969
311 voice 0.005226
125 head 0.005226
231 ron 0.005226
这两种方法难道不应该产生相同的顶级特征结果吗?我只想计算 tfidf 分数并获得前 5 个特征/权重。我做错了什么?
我不确定我在这里看到的是什么,但我感觉您错误地使用了 TfidfVectorizer
。但是,如果我对您正在尝试的内容有错误的想法,请纠正我。
所以..您需要的是提供给 fit_transform()
的文档列表。由此,您可以构建一个矩阵,例如,每列代表一个文档,每行代表一个单词。该矩阵中的一个单元格是文档 j 中单词 i 的 tf-idf 分数。
这是一个例子:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"This is a document.",
"This is another document with slightly more text.",
"Whereas this is yet another document with even more text than the other ones.",
"This document is awesome and also rather long.",
"The car he drove was red."
]
document_names = ['Doc {:d}'.format(i) for i in range(len(documents))]
def get_tfidf(docs, ngram_range=(1,1), index=None):
vect = TfidfVectorizer(stop_words='english', ngram_range=ngram_range)
tfidf = vect.fit_transform(documents).todense()
return pd.DataFrame(tfidf, columns=vect.get_feature_names(), index=index).T
print(get_tfidf(documents, ngram_range=(1,2), index=document_names))
这会给你:
Doc 0 Doc 1 Doc 2 Doc 3 Doc 4
awesome 0.0 0.000000 0.000000 0.481270 0.000000
awesome long 0.0 0.000000 0.000000 0.481270 0.000000
car 0.0 0.000000 0.000000 0.000000 0.447214
car drove 0.0 0.000000 0.000000 0.000000 0.447214
document 1.0 0.282814 0.282814 0.271139 0.000000
document awesome 0.0 0.000000 0.000000 0.481270 0.000000
document slightly 0.0 0.501992 0.000000 0.000000 0.000000
document text 0.0 0.000000 0.501992 0.000000 0.000000
drove 0.0 0.000000 0.000000 0.000000 0.447214
drove red 0.0 0.000000 0.000000 0.000000 0.447214
long 0.0 0.000000 0.000000 0.481270 0.000000
ones 0.0 0.000000 0.501992 0.000000 0.000000
red 0.0 0.000000 0.000000 0.000000 0.447214
slightly 0.0 0.501992 0.000000 0.000000 0.000000
slightly text 0.0 0.501992 0.000000 0.000000 0.000000
text 0.0 0.405004 0.405004 0.000000 0.000000
text ones 0.0 0.000000 0.501992 0.000000 0.000000
<小时/>
您展示的获取单词及其各自分数的两种方法分别计算所有文档的平均值并获取每个单词的最大分数。
所以让我们这样做并比较这两种方法:
df = get_tfidf(documents, ngram_range=(1,2), index=index)
print(pd.DataFrame([df.mean(1), df.max(1)], index=['score_mean', 'score_max']).T)
我们可以看到分数当然是不同的。
score_mean score_max
awesome 0.096254 0.481270
awesome long 0.096254 0.481270
car 0.089443 0.447214
car drove 0.089443 0.447214
document 0.367353 1.000000
document awesome 0.096254 0.481270
document slightly 0.100398 0.501992
document text 0.100398 0.501992
drove 0.089443 0.447214
drove red 0.089443 0.447214
long 0.096254 0.481270
ones 0.100398 0.501992
red 0.089443 0.447214
slightly 0.100398 0.501992
slightly text 0.100398 0.501992
text 0.162002 0.405004
text ones 0.100398 0.501992
<小时/>
注意:
您可以说服自己,这与在 TfidfVectorizer
上调用 min/max 的作用相同:
vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
tfidf = vect.fit_transform(documents)
print(tfidf.max(0))
print(tfidf.mean(0))
我是一名优秀的程序员,十分优秀!