python - 如何使用 scikit-learn 加载文件并处理 .txt 文件？-6ren

python - 如何使用 scikit-learn 加载文件并处理 .txt 文件？

转载作者：行者123 更新时间：2023-11-30 09:56:18

假设我在桌面上的一个文件夹中有不同的 .txt 文件。它们看起来像这样。

文件_1:

('this', 'is'), ('a', 'very'),....., ('large', '.txt'), ('file', 'with'), ('lots', 'of'), ('words', 'like'), ('this', 'i'), ('would', 'like'), ('to', 'create'), ('a', 'matrix'),'LABEL_1'

...

文件_N:

('this', 'is'), ('a', 'another'),....., ('large', '.txt'), ('file', 'with'), ('lots', 'of'), ('words', 'like'), ('this', 'i'), ('would', 'like'), ('to', 'create'), ('a', 'matrix'),'LABEL_N'

来自documentation ，scikit-learn 提供 load_files，我可以使用哈希技巧进行矢量化，如下所示:

from sklearn.feature_extraction.text import FeatureHasher
from sklearn.svm import SVC

training_data = [[('string1', 'string2'), ('string3', 'string4'),
                  ('string5', 'string6'), 'POS'],
                 [('string1', 'string2'), ('string3', 'string4'), 'NEG']]

feature_hasher_vect = FeatureHasher(input_type ='string')

X = feature_hasher_vect.transform(((' '.join(x) for x in sample)
                                        for sample in training_data))

print X.toarray()

输出:

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

如何使用 load_files() 或任何其他方法对整个 .txt 文件夹进行矢量化(应用上述相同的过程)？

最佳答案

我不熟悉skikit-learn，它可能有更好的东西，但是如果文件采用所示的格式，使用相对简单的东西，您可以按照您所描述的进行操作，如以下函数所示:

import ast
import glob
import os

def my_load_files(folder, pattern):
    pathname = os.path.join(folder, pattern)
    for filename in glob.glob(pathname):
        with open(filename) as file:
            yield ast.literal_eval(file.read())

text_folder = 'C:/Users/username/Desktop/Samples'
print [[' '.join(x) for x in sample]
                        for sample in my_load_files(text_folder, 'File_*')]

注意:由于每个文件(以及您的training_data)末尾都有一个标签，因此您可能需要使用以下内容，这样会将其排除在传递给 的内容之外feature_hasher_vect.transform() 方法:

print [[' '.join(x) for x in sample[:-1]]
                        for sample in my_load_files(text_folder, 'File_*')]

关于python - 如何使用 scikit-learn 加载文件并处理 .txt 文件？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27576462/

文章推荐： java - 如何将 SOCKS 与 HtmlUnit 一起使用？

文章推荐： javascript - 如何使用过滤器删除数组中的对象？ AngularJS

文章推荐： java - setVisible 的问题 (true)

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何使用 scikit-learn 加载文件并处理 .txt 文件？