python - 为什么 TfidVectorizer.fit_transform() 会更改我的文本数据的样本和标签数量？

转载作者：行者123 更新时间：2023-12-01 07:22:56

26

4

我有一个数据集，其中包含 3 列 310 条数据。这些列都是文本。第一列是用户在查询表单中输入的文本，第二列是标签(六个标签之一)，说明输入属于哪个查询类别。

>>> data.shape
(310 x 3)

在通过 sklearn.cluster 中的 KMeans 算法运行数据之前，我会对数据进行以下预处理

v = TfidfVectorizer()
vectorized = v.fit_transform(data)

现在，

>>> vectorized.shape
(3,4)

从我所看到的地方来看，我似乎丢失了数据。我不再有 310 个样本了。我相信矢量化的形状是指[n_samples, n_features]。

为什么样本和特征的值会发生变化？我预计样本数为 310，特征数为 6(标记数据的唯一分组数。

最佳答案

问题是 TfidfVectorizer() 无法一次应用于三列。

根据documentation :

fit_transform(self, raw_documents, y=None)

Learn vocabulary and idf, return term-document matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters: raw_documents : iterable
an iterable which yields either str, unicode or file objects

Returns: X : sparse matrix, [n_samples, n_features]
Tf-idf-weighted document-term matrix.

因此，仅适用于单列文本数据。在您的代码中，它刚刚迭代了列名称并为其创建了一个转换。

一个了解正在发生的事情的示例:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

data = pd.DataFrame({'col1':['this is first sentence','this one is the second sentence'],
                    'col2':['this is first sentence','this one is the second sentence'],
                    'col3':['this is first sentence','this one is the second sentence'] })
vec = TfidfVectorizer()
vec.fit_transform(data).todense()

# 
# matrix([[1., 0., 0.],
#         [0., 1., 0.],
#         [0., 0., 1.]])

vec.get_feature_names()

# ['col1', 'col2', 'col3']

现在，解决方案是您必须将所有三列合并为一列，或者在每一列上分别应用矢量化器，然后将它们附加在末尾。

方法 1

data.loc[:,'full_text'] = data.apply(lambda x: ' '.join(x), axis=1)
vec = TfidfVectorizer()
X = vec.fit_transform(data['full_text']).todense()
print(X.shape)
# (2, 7)

print(vec.get_feature_names())
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']

方法 2

from scipy.sparse import hstack
import numpy as np

vec={}
X = []
for col in data[['col1','col2','col3']]:
    vec[col]= TfidfVectorizer()
    X = np.append(X, 
                  vec[col].fit_transform(data[col]))

stacked_X = hstack(X).todense()
stacked_X.shape
# (2, 21)

for col, v in vec.items():
    print(col)
    print(v.get_feature_names())

# col1
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
# col2
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
# col3
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']

关于python - 为什么 TfidVectorizer.fit_transform() 会更改我的文本数据的样本和标签数量？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57599174/

26

4

0

文章推荐： qt - 使用 qrand() 和 qsrand() 的唯一随机数序列

文章推荐： java - 这个方法是重载、重写还是无？

首页

博学

6Ren·AI

商城

python - 为什么 TfidVectorizer.fit_transform() 会更改我的文本数据的样本和标签数量？

方法 1

方法 2

标签)？
根据 Web 标准，创建带有标题 1 的链接的正确代码是什么？是吗 stackoverflow 或 stackoverflow 谢谢最佳答案根据网络标准，您不能将 block 元素放入内

首页

博学

6Ren·AI

商城

python - 为什么 TfidVectorizer.fit_transform() 会更改我的文本数据的样本和标签数量？

方法 1

方法 2

标签)？ 根据 Web 标准，创建带有标题 1 的链接的正确代码是什么？ 是吗 stackoverflow 或 stackoverflow 谢谢 最佳答案 根据网络标准，您不能将 block 元素放入内

标签)？
根据 Web 标准，创建带有标题 1 的链接的正确代码是什么？是吗 stackoverflow 或 stackoverflow 谢谢最佳答案根据网络标准，您不能将 block 元素放入内