gpt4 book ai didi

python - sklearn : How to speed up a vectorizer (eg Tfidfvectorizer)

转载 作者:太空狗 更新时间:2023-10-29 20:42:13 24 4
gpt4 key购买 nike

在彻底分析我的程序后,我已经能够确定它正在被矢量化器减慢。

我正在处理文本数据,两行简单的 tfidf unigram 向量化占用了代码执行总时间的 99.2%。

这是一个可运行的示例(这会将一个 3mb 的训练文件下载到您的磁盘,省略 urllib 部分以在您自己的示例上运行):

#####################################
# Loading Data
#####################################
import urllib
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk.stem
raw = urllib.urlopen("https://s3.amazonaws.com/hr-testcases/597/assets/trainingdata.txt").read()
file = open("to_delete.txt","w").write(raw)
###
def extract_training():
f = open("to_delete.txt")
N = int(f.readline())
X = []
y = []
for i in xrange(N):
line = f.readline()
label,text = int(line[0]), line[2:]
X.append(text)
y.append(label)
return X,y
X_train, y_train = extract_training()
#############################################
# Extending Tfidf to have only stemmed features
#############################################
english_stemmer = nltk.stem.SnowballStemmer('english')

class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
#############################################
# Line below takes 6-7 seconds on my machine
#############################################
Xv = tfidf.fit_transform(X_train)

我尝试将列表 X_train 转换为 np.array,但性能没有差异。

最佳答案

毫不奇怪,是 NLTK 比较慢:

>>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 4.89 s per loop
>>> tfidf = TfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 415 ms per loop

您可以通过使用更智能的 Snowball 词干提取器实现来加快速度,例如 PyStemmer :

>>> import Stemmer
>>> english_stemmer = Stemmer.Stemmer('en')
>>> class StemmedTfidfVectorizer(TfidfVectorizer):
... def build_analyzer(self):
... analyzer = super(TfidfVectorizer, self).build_analyzer()
... return lambda doc: english_stemmer.stemWords(analyzer(doc))
...
>>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 650 ms per loop

NLTK 是一个教学工具包。它的设计速度很慢,因为它针对可读性进行了优化。

关于python - sklearn : How to speed up a vectorizer (eg Tfidfvectorizer),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26195699/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com