gpt4 book ai didi

python - vectorizer fit_transform 如何在 sklearn 中工作?

转载 作者:太空宇宙 更新时间:2023-11-03 15:49:10 25 4
gpt4 key购买 nike

我试图理解下面的代码

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer()

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?']

X = vectorizer.fit_transform(corpus)

当我尝试打印 X 以查看将返回什么时,我得到了这个结果:

(0, 1)  1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 2

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

但是,我不明白这个结果的含义?

最佳答案

正如@Himanshu 所写,这是一个“(sentence_index, feature_index) count”

这里,计数部分是“一个词在文档中出现的次数”

例如,

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 2 Only for this example, the count "2" tells that the word "and" appears twice in this document/sentence

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

让我们更改代码中的语料库。基本上,我在语料库列表的第二句中添加了两次“第二”这个词。

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer()

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?']

X = vectorizer.fit_transform(corpus)

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 4 for the modified corpus, the count "4" tells that the word "second" appears four times in this document/sentence

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

关于python - vectorizer fit_transform 如何在 sklearn 中工作?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47898326/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com