gpt4 book ai didi

python - 如何通过将大于 2 GiB 的文件分割成更小的片段来 pickle 它们

转载 作者:行者123 更新时间:2023-12-01 02:19:37 24 4
gpt4 key购买 nike

我有一个大于 2 GiB 的分类器对象,我想对其进行 pickle,但我得到了这个:

cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)

OverflowError: cannot serialize a string larger than 2 GiB

我找到了this问题有同样的问题,有人建议

  1. 使用Python 3协议(protocol)4 - Not Acceptable ,因为我需要使用Python 2
  2. 使用from pyocser import ocdumps, ocloads - Not Acceptable ,因为我无法使用其他(重要的)模块
  3. 将对象分解为字节并 pickle 每个片段

有没有办法用我的分类器来做到这一点?即,将其转换为字节,拆分,pickle,unpickle,连接字节,然后使用分类器?

<小时/>

我的代码:

from sklearn.svm import SVC 
import cPickle

def train_clf(X,y,clf_name):
start_time = time.time()
# after many tests, this was found to be best classifier
clf = SVC(C = 0.01, kernel='poly')
clf.fit(X,y)
print 'fit done... {} seconds'.format(time.time() - start_time)
with open(clf_name, "wb") as fo:
cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)
# cPickle.HIGHEST_PROTOCOL == 2
# the error occurs inside the dump method
return time.time() - start_time

在此之后,我想解封并使用:

with open(clf_name, 'rb') as fo:
clf, load_time = cPickle.load(fo), time.time()

最佳答案

如果模型尺寸很大,您可以使用 sklearn.external.joblib 自动将模型文件拆分为 pickled numpy 数组文件

from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')

更新: sklearn 将显示

DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.

所以请使用这个。

import joblib
joblib.dump(clf, 'filename.pkl')

稍后可以使用以下方法解封:

clf = joblib.load('filename.pkl') 

关于python - 如何通过将大于 2 GiB 的文件分割成更小的片段来 pickle 它们,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48074419/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com