gpt4 book ai didi

python - tfidfvectorizer 根据所有单词打印结果

转载 作者:行者123 更新时间:2023-12-01 03:08:12 27 4
gpt4 key购买 nike

虽然有六个不同的词。结果只打印了5个字。如何根据所有单词(6列向量)获得结果?

from sklearn.feature_extraction.text import TfidfVectorizer
sent=["This is a sample", "This is another example"]
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0)
tfidf_matrix = tf.fit_transform(sent)
print tfidf_matrix.toarray()

[[ 0. 0. 0.50154891 0.70490949 0.50154891] [ 0.57615236 0.57615236 0.40993715 0. 0.40993715]]

另外如何打印列详细信息(特征(单词))和行(文档)?

最佳答案

您正在使用默认的 token_pattern,它仅选择 2 个或更多字符的标记。

token_pattern :

“token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)

如果定义新的 token_pattern,您将获得“a”字符,例如:

from sklearn.feature_extraction.text import TfidfVectorizer
sent=["This is a sample", "This is another example"]
tf = TfidfVectorizer(token_pattern=u'(?u)\\b\\w+\\b')
tfidf_matrix = tf.fit_transform(sent)
print tfidf_matrix.toarray()
tf.vocabulary_

[[ 0.57615236 0.0.0.40993715 0.57615236 0.40993715] [0.0.57615236 0.57615236 0.40993715 0.0.40993715]]

tf.vocabulary_

{u'a': 0、u'sample': 4、u'another': 1、u'this': 5、u'is': 3、u'example': 2}

关于python - tfidfvectorizer 根据所有单词打印结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43136202/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com