gpt4 book ai didi

python - 带有生成器的大型语料库上的 TfidfVectorizer

转载 作者:行者123 更新时间:2023-12-03 21:22:24 26 4
gpt4 key购买 nike

我将大型语料库拆分为 5K 个文件,我正在尝试使用 TF-IDF 转换生成基于 IDF 的词汇表。

这是代码:基本上我有一个迭代器,它循环遍历 .tsv 文件的目录,读取每个文件并产生。

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import pandas as pd
import numpy as np
import os
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def make_corpus():
inputFeatureFiles = [x for x in os.listdir('C:\Folder') if x.endswith("*.tsv")]
for file in inputFeatureFiles:
filePath= 'C:\\' + os.path.splitext(file)[0] + ".tsv"
with open(filePath, 'rb') as infile:
content = infile.read()
yield content

corpus = make_corpus()
vectorizer = TfidfVectorizer(stop_words='english',use_idf=True, max_df=0.7, smooth_idf=True)

vectorizer.fit_transform(corpus)

这会产生以下错误:
c:\python27\lib\site-packages\sklearn\feature_extraction\text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
809 vocabulary = dict(vocabulary)
810 if not vocabulary:
--> 811 raise ValueError("empty vocabulary; perhaps the documents only"
812 " contain stop words")
813

ValueError: empty vocabulary; perhaps the documents only contain stop words

我也试过这个:
corpusGenerator= [open(os.path.join('C:\CorpusFiles\',f)) for f in os.listdir('C:\CorpusFiles')]
vectorizer = TfidfVectorizer(stop_words='english',use_idf=True,smooth_idf=True, sublinear_tf=True, input="file", min_df=1)
feat = vectorizer.fit_transform(corpusGenerator)

并得到以下错误:
[Errno 24] Too many open files: 'C:\CorpusFiles\file1.tsv'

在大型语料库上使用 TFIDFVectorizer 的最佳方法是什么?我还尝试将一个常量字符串附加到每个 yield 字符串以避免第一个错误,但这也没有解决它。感谢任何帮助!

最佳答案

哎,最近我也在研究同样的问题。根据我的经验,也许您可​​以尝试以下演示代码:

import glob
all_files_path = glob.glob(path_to_the_dir_of_your_data_files)

def fit_iterator():
for file_path in all_files_path:
with open(file_path, "r", encoding="utf-8") as file:
for line in file:
yield line # please make sure that line is a instance of str
# representing a single sample.

corpus = fit_iterator()
tfidf = TfidfVectorizer()
tfidf.fit(corpus)
祝你好运!

关于python - 带有生成器的大型语料库上的 TfidfVectorizer,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50750052/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com