python - 使用 countVectorizer 在 python 中计算我自己词汇的单词出现次数-6ren

python - 使用 countVectorizer 在 python 中计算我自己词汇的单词出现次数

转载作者：行者123 更新时间：2023-12-04 01:57:27

Doc1: ['And that was the fallacy. Once I was free to talk with staff members']

Doc2: ['In the new, stripped-down, every-job-counts business climate, these human']

Doc3 : ['Another reality makes emotional intelligence ever more crucial']

Doc4: ['The globalization of the workforce puts a particular premium on emotional']

Doc5: ['As business changes, so do the traits needed to excel. Data tracking']

这是我的词汇示例:

my_vocabulary= [‘was the fallacy’, ‘free to’, ‘stripped-down’, ‘ever more’, ‘of the workforce’, ‘the traits needed’]

关键是我词汇表中的每个单词都是二元词或三元词。我的词汇包括我的文档集中所有可能的二元词和三元词，我只是在这里给了你一个样本。根据应用程序，这就是我的词汇应该是怎样的。我正在尝试使用 countVectorizer 如下:

from sklearn.feature_extraction.text import CountVectorizer
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]
vectorizer = CountVectorizer( vocabulary=my_vocabulary)
tf = vectorizer.fit_transform(doc_set)

我期待得到这样的东西:

print tf:
(0, 126)    1
(0, 6804)   1
(0, 5619)   1
(0, 5019)   2
(0, 5012)   1
(0, 999)    1
(0, 996)    1
(0, 4756)   4

其中第一列是文档 ID，第二列是词汇表中的词 ID，第三列是该词在该文档中的出现次数。但是 tf 是空的。我知道在一天结束时，我可以编写一个代码来遍历词汇表中的所有单词并计算出现次数并制作矩阵，但是我可以将 countVectorizer 用于我拥有的这个输入并节省时间吗？我在这里做错了吗？如果 countVectorizer 不是正确的方法，任何建议将不胜感激。

最佳答案

您可以通过在 CountVectorizer 中指定 ngram_range 参数来构建所有可能的二元词组和三元词组的词汇表。在 fit_tranform 之后，您可以使用 get_feature_names() 和 toarray() 方法查看词汇和频率。后者为每个文档返回一个频率矩阵。更多信息:http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

from sklearn.feature_extraction.text import CountVectorizer

Doc1 = 'And that was the fallacy. Once I was free to talk with staff members'
Doc2 = 'In the new, stripped-down, every-job-counts business climate, these human'
Doc3 = 'Another reality makes emotional intelligence ever more crucial'
Doc4 = 'The globalization of the workforce puts a particular premium on emotional'
Doc5 = 'As business changes, so do the traits needed to excel. Data tracking'
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]

vectorizer = CountVectorizer(ngram_range=(2, 3))
tf = vectorizer.fit_transform(doc_set)
vectorizer.vocabulary_
vectorizer.get_feature_names()
tf.toarray()

至于您尝试做的事情，如果您在词汇表上训练 CountVectorizer 然后转换文档，它会起作用。

my_vocabulary= ['was the fallacy', 'more crucial', 'particular premium', 'to excel', 'data tracking', 'another reality']

vectorizer = CountVectorizer(ngram_range=(2, 3))
vectorizer.fit_transform(my_vocabulary)
tf = vectorizer.transform(doc_set)

vectorizer.vocabulary_
Out[26]: 
{'another reality': 0,
 'data tracking': 1,
 'more crucial': 2,
 'particular premium': 3,
 'the fallacy': 4,
 'to excel': 5,
 'was the': 6,
 'was the fallacy': 7}

tf.toarray()
Out[25]: 
array([[0, 0, 0, 0, 1, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0]], dtype=int64)

关于python - 使用 countVectorizer 在 python 中计算我自己词汇的单词出现次数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49618950/

文章推荐： php - Adyen API 401 未经授权

文章推荐： vim - 如何解决在 Vim 中切换不同语言的键盘布局的烦恼？

文章推荐： xaml - 如何仅在 Android 上隐藏 TabbedPage 上的标题栏？

c# - 字典 API(词汇)
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。关闭 4 年前。
semantic-web - 了解要使用的 RDFA 词汇
我们如何知道使用哪个词汇/命名空间来描述带有 RDFa 的数据？我看过很多使用 xmlns:dcterms="http://purl.org/dc/terms/" 的例子或 xmlns:sioc="
huggingface-transformers - 理解 BERT 词汇 [unusedxxx] token :
我正在尝试理解 BERT 词汇 here .它有 1000 个 [unusedxxx] token 。我不遵循这些 token 的用法。我了解其他特殊 token ，如 [SEP]、[CLS]，但 [
Oracle 词汇，什么是 mysql/SQL Server 相当于数据库
我需要一些词汇方面的帮助，我不经常使用 Oracle，但我熟悉 MySQL 和 SQL Server。我有一个应用程序需要升级和迁移，执行此操作的部分过程涉及导出到 XML 文件，允许安装程序创建新
ruby - 解析 RDFa、微数据等的最佳方式是什么，使用统一的模式/词汇(例如 schema.org)存储和显示信息
我主要使用 Ruby 来执行此操作，但到目前为止我的攻击计划如下: 使用 gems rdf、rdf-rdfa 和 rdf-microdata 或 mida 来解析给定任何 URI 的数据。我认为最好映

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 使用 countVectorizer 在 python 中计算我自己词汇的单词出现次数