gpt4 book ai didi

python - 从 TfidfVectorizer 获取全文

转载 作者:行者123 更新时间:2023-12-01 03:07:07 27 4
gpt4 key购买 nike

我正在绘制一组二维文本文档,我注意到一些异常值,我希望能够找出这些异常值是什么。我使用原始文本,然后使用 SKLearn 内置的 TfidfVectorizer。

  vectorizer = TfidfVectorizer(max_df=0.5, max_features=None,
min_df=2, stop_words='english',
use_idf=True, lowercase=True)

corpus = make_corpus(root)
X = vectorizer.fit_transform(corpus)

为了减少到二维,我使用 TruncatedSVD。

reduced_data = TruncatedSVD(n_components=2).fit_transform(X)

如果我想找到哪个文本文档具有最高的第二主成分(y 轴),我该怎么做?

最佳答案

因此,根据我的理解,您想知道哪个文档最大化了特定的主成分。这是我想出的玩具示例:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np

corpus = [
'this is my first corpus',
'this is my second corpus which is longer than the first',
'here is yet another one, but it is brief',
'and watch out for number four chuggin along',
'blah blah blah my final sentence yada yada yada'
]

vectorizer = TfidfVectorizer(stop_words='english',
use_idf=True, lowercase=True)

# first get TFIDF matrix
X = vectorizer.fit_transform(corpus)

# second compress to two dimensions
svd = TruncatedSVD(n_components=2).fit(X)
reduced = svd.transform(X)

# now, find the doc with the highest 2nd prin comp
corpus[np.argmax(reduced[:, 1])]

其产量:

'and watch out for number four chuggin along'

关于python - 从 TfidfVectorizer 获取全文,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43263837/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com