gpt4 book ai didi

python - 创建 tf-idf 值矩阵

转载 作者:行者123 更新时间:2023-11-30 23:23:23 27 4
gpt4 key购买 nike

我有一组文档,例如:

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

和一组单词,例如:

"sky","land","sea","water","sun","moon"

我想创建一个像这样的矩阵:

   x        D1           D2         D3
sky tf-idf 0 tf-idf
land 0 0 0
sea 0 0 0
water 0 0 0
sun 0 tf-idf tf-idf
moon 0 0 0

类似于此处给出的示例表:http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html 。在给定的链接中,它使用文档中的相同单词,但我需要使用我提到的一组单词。

如果文档中存在特定单词,则我将输入 tf-idf 值,否则我将在矩阵中输入 0

知道如何构建这样的某种矩阵吗? Python 是最好的,但 R 也很受欢迎。

我正在使用以下代码,但不确定我是否做了正确的事情。我的代码是:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords


train_set = "The sky is blue.", "The sun is bright.", "The sun in the sky is bright." #Documents
test_set = ["sky","land","sea","water","sun","moon"] #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
#print 'Fit Vectorizer to train set', trainVectorizerArray
#print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
#print
#print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
#print
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

我得到了这样非常荒谬的结果(值只有 01,而我期望值介于 0 和 1 之间)。

[[ 0.  0.  1.  0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 0. 0.]
[ 1. 0. 0. 0.]]

我也开放其他库来计算tf-idf。我只想要一个上面提到的正确矩阵。

最佳答案

R 解决方案可能如下所示:

library(tm)
docs <- c(D1 = "The sky is blue.",
D2 = "The sun is bright.",
D3 = "The sun in the sky is bright.")
dict <- c("sky","land","sea","water","sun","moon")
mat <- TermDocumentMatrix(Corpus(VectorSource(docs)),
control=list(weighting = weightTfIdf,
dictionary = dict))
as.matrix(mat)[dict, ]
# Docs
# Terms D1 D2 D3
# sky 0.5849625 0.0000000 0.2924813
# land 0.0000000 0.0000000 0.0000000
# sea 0.0000000 0.0000000 0.0000000
# water 0.0000000 0.0000000 0.0000000
# sun 0.0000000 0.5849625 0.2924813
# moon 0.0000000 0.0000000 0.0000000

关于python - 创建 tf-idf 值矩阵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23999170/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com