python - 为什么 vectorizer.fit_transform(x).astype ('bool' ) 与 vectorizer.set_params(binary=True).fit

python - 为什么 vectorizer.fit_transform(x).astype ('bool' ) 与 vectorizer.set_params(binary=True).fit_transform(x) 不同？

转载作者：行者123 更新时间：2023-12-05 00:46:54

24

4

这是我正在谈论的一个最小的例子:

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

data = fetch_20newsgroups()
x = data.data

vec = TfidfVectorizer(min_df=0.01, max_df=0.5)
mat = vec.fit_transform(x).astype('bool')

vec.set_params(binary=True)
print(np.array_equal(mat, vec.fit_transform(x)))

这打印 False .设置 binary=True 之间的根本区别是什么？并将所有非零值设置为 True ?

编辑:正如@juanpa.arrivillaga 的回答， TfidfVectorizer(binary=True)仍然进行逆文档频率计算。但是，我也注意到 CountVectorizer(binary=True)不会产生与 .astype('bool') 相同的输出任何一个。下面是一个例子:

In [1]: import numpy as np
   ...: from sklearn.datasets import fetch_20newsgroups
   ...: from sklearn.feature_extraction.text import CountVectorizer
   ...:
   ...: data = fetch_20newsgroups()
   ...: x = data.data
   ...:
   ...: vec = CountVectorizer(min_df=0.01, max_df=0.5)
   ...: a = vec.fit_transform(x).astype('bool')
   ...:
   ...: vec.set_params(binary=True)
   ...: b = vec.fit_transform(x).astype('bool')
   ...: print(np.array_equal(a, b))
   ...:
False

In [2]: a
Out[2]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
        with 950068 stored elements in Compressed Sparse Row format>

In [3]: b
Out[3]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
        with 950068 stored elements in Compressed Sparse Row format>

维度和 dtype 是相同的，这让我相信这些矩阵的内容是不同的。只需观察 print(a) 的输出和 print(b) ，它们看起来一样。

最佳答案

你从根本上混淆了两件事。

一种是转换为 bool numpy 数据类型，它等效于接受两个值 True 和 False 的 python 数据类型，除了它在底层原始数组中表示为单个字节。

路过binary TfidfVectorizer 的论据改变数据建模的方式。总之，如果你使用 binary=True ，总计数将是二进制的，即可见或不可见。然后你做通常的 tf-id 转换。 From the docs :

If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs.)

所以你甚至没有得到一个 bool 输出。

所以考虑:

In [10]: import numpy as np
    ...: from sklearn.feature_extraction.text import TfidfVectorizer
    ...:

In [11]: data = [
    ...:     'The quick brown fox jumped over the lazy dog',
    ...:     'how much wood could a woodchuck chuck if a woodchuck could chuck wood'
    ...: ]

In [12]: TfidfVectorizer().fit_transform(data).todense()
Out[12]:
matrix([[ 0.30151134,  0.        ,  0.        ,  0.30151134,  0.30151134,
          0.        ,  0.        ,  0.30151134,  0.30151134,  0.        ,
          0.30151134,  0.30151134,  0.60302269,  0.        ,  0.        ],
        [ 0.        ,  0.45883147,  0.45883147,  0.        ,  0.        ,
          0.22941573,  0.22941573,  0.        ,  0.        ,  0.22941573,
          0.        ,  0.        ,  0.        ,  0.45883147,  0.45883147]])

In [13]: TfidfVectorizer().fit_transform(data).todense().astype('bool')
Out[13]:
matrix([[ True, False, False,  True,  True, False, False,  True,  True,
         False,  True,  True,  True, False, False],
        [False,  True,  True, False, False,  True,  True, False, False,
          True, False, False, False,  True,  True]], dtype=bool)

现在请注意使用 binary仍将返回浮点类型:

In [14]: TfidfVectorizer(binary=True).fit_transform(data).todense()
Out[14]:
matrix([[ 0.35355339,  0.        ,  0.        ,  0.35355339,  0.35355339,
          0.        ,  0.        ,  0.35355339,  0.35355339,  0.        ,
          0.35355339,  0.35355339,  0.35355339,  0.        ,  0.        ],
        [ 0.        ,  0.37796447,  0.37796447,  0.        ,  0.        ,
          0.37796447,  0.37796447,  0.        ,  0.        ,  0.37796447,
          0.        ,  0.        ,  0.        ,  0.37796447,  0.37796447]])

它只是改变了结果。

关于python - 为什么 vectorizer.fit_transform(x).astype ('bool' ) 与 vectorizer.set_params(binary=True).fit_transform(x) 不同？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53145395/

24

4

0

文章推荐：可观察的 zip(数组)的 RXJS zip 未触发

文章推荐： alias - 如何为 coq 中的类型指定别名

文章推荐： Airflow :将 dag 设置为不自动安排

c# - 如何将 bool 转换为可为 null 的 bool(bool？)
我有一个带有列的表提供者 implied(tiny int)(something like nullable bool) provi
c# - 为什么编译器将 bool 转换为整数并返回 bool 而不是返回 bool 本身？
我正在阅读 VideoFileWriter来自 AForge.Video.FFMPEG 的类(class)通过 ILSPY 组装(我很想看看特定方法是如何工作的)并发现了这个: public bool
flutter - 无法将类型 'bool?' 的值分配给类型 'bool' 的变量，因为 'bool?' 可以为 null，而 'bool' 则不能
这是我的完整代码... import 'package:flutter/cupertino.dart'; import 'package:flutter/material.dart'; import
haskell - 从 [Maybe Bool] 中获得一个 Bool，该 Bool 保证至少包含一个 Just
我有一个输入 list类型 [Maybe SomeType]和一个谓词 p类型 SomeType -> Bool ，我想回答这个问题“谓词 p 是否适用于所有碰巧在输入中的 SomeType ？”。
转换为 bool : `!!` vs `(bool)`
使用 !!x 有什么区别吗？对比(bool)x ？假设__STDC_VERSION__ >= 199901L和 #include 他们都保证结果是0吗？或 1 ，并且无论 x 的大小和值如何，都不
c++ - (bool | bool) 安全吗？
我正在编写一些 C++ 代码，我想调用两个函数(checkXDirty 和 checkYDirty)，并返回 true如果任一返回 true。即使一个返回 true 我也需要评估两者，所以我的第一个想
c++ - "#define bool bool"当我悬停 bool 时说 QtCreator - 我将其跟踪到 boost::asio
我注意到 bool在 QtCreator 中以不同于其他类型的颜色突出显示: 只有在包含某些 header 时才会发生这种情况，最终我将其追踪到 . QtCreator 的代码检查器似乎无法手动跟踪
ios - 类型 "Int -> Bool","Int-> Bool -> Int","Int-> String -> Int－> Bool"
有一个函数: func (first: Int) -> Int -> Bool -> String { return ? } 返回值怎么写？我对上面 func 的返回类型感到很困惑。最
python - 为什么 tf.cond() 将 tf.bool 识别为 python bool 而不是 tf.bool？
训练神经网络学习“异或” 我正在尝试使用“批量归一化”，我创建了一个批量归一化层函数“batch_norm1”。 import tensorflow as tf import nump
c# - 如何从 C# 中的异步任务函数获取 bool 结果 - 错误 : Cannot implicitly convert type `void' to `bool'
我已经创建了任务函数来验证我的 json 文件。一切正常，直到我没有使用结果。当我试图从 async task function 获得结果时它显示错误为 Cannot implicitly conve
Swift.Bool 不是 Bool 吗？
我有一个函数 func login (parameters: [(String, Any)], completion: @escaping (Bool) -> Vo
正则表达式用 bool 替换 bool 值
我正在处理最近从 X/Motif 转移到 Qt 的 C++ 代码库。我正在尝试编写一个 Perl 脚本，它将用 bool 替换所有出现的 Boolean(来自 X)。该脚本只是做了一个简单的替换。 s
flutter - 无法将参数类型Future 分配给参数类型 'bool'
嗨，我正尝试创建一个Visiblity小部件，如果用户在Firebase数据库阵列上，该小部件将显示。看起来像这样(成员数组): 如您所见，我创建了一个StreamBuilder，如果当前用户的用户名
if-statement - Flutter中的Future vs bool
我创建了如下的rest api方法， Future activateAccount(int id, int code) async{ final body = {"code": '$c
flutter - 如何将Future 转换为Stream
在我的Flutter应用中，我有一个返回Future的函数，但我想将结果作为Stream。这是函数: Future isGpsOn() async { if (await Geolocat
SQLAlchemy bool 值与 bool 值
我可以看到 BOOLEAN 覆盖了 __visit_name__ class BOOLEAN(Boolean): __visit_name__ = 'BOOLEAN' 控制调度员选择的访问者方
c# - 分配 bool 值？ bool
考虑以下代码: bool x; bool? y = null; x = y?? true; 将 bool? 分配给 bool 是一个编译时错误，但上面的代码在编译和运行时都成功了。为什么？尽管第三条语
javascript( bool 值 ^ bool 值)
我正在重写一些 Javascript 代码以在 Excel VBA 中工作。由于在这个网站上搜索，我已经设法翻译了几乎所有的 Javascript 代码!但是，有些代码我无法准确理解它在做什么。这是一
boolean - 预期类型 `bool` ，发现类型 `&bool`
我想拍一张bool来自Vec并在 if 语句中进行比较。如何解决以下错误？ | 7 | if cell { | ^^^^ expected
ios - bool _WebTryThreadLock(bool) 崩溃
我在我的应用程序崩溃跟踪工具中发现了一些崩溃。基本上我有一个 tabBarController，其中一个选项卡有一个嵌入式 UIWebView，另一个选项卡有一个带有 UITableView 的 Co

首页

博学

6Ren·AI

商城

python - 为什么 vectorizer.fit_transform(x).astype ('bool' ) 与 vectorizer.set_params(binary=True).fit_transform(x) 不同？