gpt4 book ai didi

python - 为什么 vectorizer.fit_transform(x).astype ('bool' ) 与 vectorizer.set_params(binary=True).fit_transform(x) 不同?

转载 作者:行者123 更新时间:2023-12-05 00:46:54 24 4
gpt4 key购买 nike

这是我正在谈论的一个最小的例子:

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

data = fetch_20newsgroups()
x = data.data

vec = TfidfVectorizer(min_df=0.01, max_df=0.5)
mat = vec.fit_transform(x).astype('bool')

vec.set_params(binary=True)
print(np.array_equal(mat, vec.fit_transform(x)))

这打印 False .设置 binary=True 之间的根本区别是什么?并将所有非零值设置为 True ?

编辑:正如@juanpa.arrivillaga 的回答, TfidfVectorizer(binary=True)仍然进行逆文档频率计算。但是,我也注意到 CountVectorizer(binary=True)不会产生与 .astype('bool') 相同的输出任何一个。下面是一个例子:
In [1]: import numpy as np
...: from sklearn.datasets import fetch_20newsgroups
...: from sklearn.feature_extraction.text import CountVectorizer
...:
...: data = fetch_20newsgroups()
...: x = data.data
...:
...: vec = CountVectorizer(min_df=0.01, max_df=0.5)
...: a = vec.fit_transform(x).astype('bool')
...:
...: vec.set_params(binary=True)
...: b = vec.fit_transform(x).astype('bool')
...: print(np.array_equal(a, b))
...:
False

In [2]: a
Out[2]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
with 950068 stored elements in Compressed Sparse Row format>

In [3]: b
Out[3]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
with 950068 stored elements in Compressed Sparse Row format>

维度和 dtype 是相同的,这让我相信这些矩阵的内容是不同的。只需观察 print(a) 的输出和 print(b) ,它们看起来一样。

最佳答案

你从根本上混淆了两件事。

一种是转换为 bool numpy 数据类型,它等效于接受两个值 True 和 False 的 python 数据类型,除了它在底层原始数组中表示为单个字节。

路过binary TfidfVectorizer 的论据改变数据建模​​的方式。总之,如果你使用 binary=True ,总计数将是二进制的,即可见或不可见。然后你做通常的 tf-id 转换。 From the docs :

If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs.)



所以你甚至没有得到一个 bool 输出。

所以考虑:
In [10]: import numpy as np
...: from sklearn.feature_extraction.text import TfidfVectorizer
...:

In [11]: data = [
...: 'The quick brown fox jumped over the lazy dog',
...: 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'
...: ]

In [12]: TfidfVectorizer().fit_transform(data).todense()
Out[12]:
matrix([[ 0.30151134, 0. , 0. , 0.30151134, 0.30151134,
0. , 0. , 0.30151134, 0.30151134, 0. ,
0.30151134, 0.30151134, 0.60302269, 0. , 0. ],
[ 0. , 0.45883147, 0.45883147, 0. , 0. ,
0.22941573, 0.22941573, 0. , 0. , 0.22941573,
0. , 0. , 0. , 0.45883147, 0.45883147]])

In [13]: TfidfVectorizer().fit_transform(data).todense().astype('bool')
Out[13]:
matrix([[ True, False, False, True, True, False, False, True, True,
False, True, True, True, False, False],
[False, True, True, False, False, True, True, False, False,
True, False, False, False, True, True]], dtype=bool)

现在请注意使用 binary仍将返回浮点类型:
In [14]: TfidfVectorizer(binary=True).fit_transform(data).todense()
Out[14]:
matrix([[ 0.35355339, 0. , 0. , 0.35355339, 0.35355339,
0. , 0. , 0.35355339, 0.35355339, 0. ,
0.35355339, 0.35355339, 0.35355339, 0. , 0. ],
[ 0. , 0.37796447, 0.37796447, 0. , 0. ,
0.37796447, 0.37796447, 0. , 0. , 0.37796447,
0. , 0. , 0. , 0.37796447, 0.37796447]])

它只是改变了结果。

关于python - 为什么 vectorizer.fit_transform(x).astype ('bool' ) 与 vectorizer.set_params(binary=True).fit_transform(x) 不同?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53145395/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com