python-2.7 - 使用 sklearn.feature_extraction.text CountVectorizer 时从文件中读取文档-6ren

python-2.7 - 使用 sklearn.feature_extraction.text CountVectorizer 时从文件中读取文档

转载作者：行者123 更新时间：2023-12-02 04:47:04

26

4

我能够使用文档示例中的代码，其中 fit_transform() 函数的输入是一个句子列表，即:

corpus = [
   'this is the first document',
   'this is the second second document',
   'and the third one',
   'is this the first document?'
]

X = vectorizer.fit_transform(语料库)

并得到预期的数据。但是当我尝试用文件列表或文件对象替换语料库时，如文档所示，它可以是:

" 适合(原始文档，y=无)

Learn a vocabulary dictionary of all tokens in the raw documents.
Parameters :    
raw_documents : iterable
    An iterable which yields either str, unicode or file objects.
Returns :   
self :

”

.. 所以我认为我对管道的理解缺少一些东西。给定一个我想要 CountVectorize 的文件目录，我该怎么做？如果我尝试提供文件对象列表，如 [open(file,'r')]，我得到的错误消息是文件对象没有较低的函数。

最佳答案

设置矢量器的输入 constructor parameter到 filename 或 file。它的默认值为 content，假设您已经将文件读入内存。

关于python-2.7 - 使用 sklearn.feature_extraction.text CountVectorizer 时从文件中读取文档，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/19592892/

26

4

0

文章推荐： Eclipse - 转换为 Maven 项目

文章推荐： Django，重定向用户无法识别的网址(不仅仅是 404 模板)

文章推荐： meteor - 两台服务器之间的 DDP 不会重新连接

python - sklearn.feature_extraction.text 中距离度量的选择 - 特征工程
我正在学习有关在 Python 中构建机器学习系统的教程，并且我正在修改它并尝试将新帖子归类为属于 7 个不同类别之一。 english_stemmer = nltk.stem.SnowballSte
python - 创建图像补丁，sklearn.feature_extraction.image.extract_patches_2d 内存错误
我正在寻找一种方法将 numpy 图像划分为像补丁一样的网格。这个任务已经回答了几次。 Extracting patches of a certain size from the image in
python - 类型错误 : a float is required in sklearn. feature_extraction.FeatureHasher
我使用的是 sklearn 版本 0.16.1。看来 FeatureHasher 不支持字符串(DictVectorizer 则支持)。例如: values = [ {'city'
python - ImportError : No module named sklearn. feature_extraction.text
我使用 python 2.7 和 pacman 包管理器，并用它安装 sclearn。但是当我有一个 ImportError: >>> from sklearn.feature_extraction.
python-2.7 - 使用 sklearn.feature_extraction.text CountVectorizer 时从文件读取文档
我可以使用文档中示例中的代码，其中 fit_transform() 函数的输入是句子列表，即: corpus = [ 'this is the first document', 'this
python-2.7 - 使用 sklearn.feature_extraction.text CountVectorizer 时从文件中读取文档
我能够使用文档示例中的代码，其中 fit_transform() 函数的输入是一个句子列表，即: corpus = [ 'this is the first document', 'thi
python - 了解 sklearn.feature_extraction.text 的 CountVectorizer 类中的 _count_vocab 方法
我在 CountVectorizer 中使用 fit_transform 方法，我正在通读代码以尝试理解它在做什么。我对 CountVectorizer 中的 _count_vocab 方法有点困惑，
python - 使用来自 sklearn.feature_extraction.text.TfidfVectorizer 的 TfidfVectorizer 计算 IDF
我认为函数 TfidfVectorizer 没有正确计算 IDF 因子。例如，从 tf-idf feature weights using sklearn.feature_extraction.tex
python - 使用 sklearn.feature_extraction.text.TfidfVectorizer 的 tf-idf 特征权重
本页:http://scikit-learn.org/stable/modules/feature_extraction.html提及: As tf–idf is a very often used

首页

博学

6Ren·AI

商城

python-2.7 - 使用 sklearn.feature_extraction.text CountVectorizer 时从文件中读取文档