gpt4 book ai didi

python - scikit-learn joblib 错误 : multiprocessing pool self. 值超出 'i' 格式代码的范围,仅适用于大型 numpy 数组

转载 作者:太空狗 更新时间:2023-10-30 02:20:16 27 4
gpt4 key购买 nike

我的代码在较小的测试样本上运行良好,例如 X_trainy_train 中的 10000 行数据。当我为数百万行调用它时,我得到了结果错误。是包中的错误,还是我可以做一些不同的事情?我正在使用来自 Anaconda 2.0.1 的 Python 2.7.7,我将 Anaconda 的多处理包中的 pool.py 和 scikit-learn 的外部包中的 parallel.py 放在我的 Dropbox 上。

测试脚本是:

import numpy as np
import sklearn
from sklearn.linear_model import SGDClassifier
from sklearn import grid_search
import multiprocessing as mp


def main():
print("Started.")

print("numpy:", np.__version__)
print("sklearn:", sklearn.__version__)

n_samples = 1000000
n_features = 1000

X_train = np.random.randn(n_samples, n_features)
y_train = np.random.randint(0, 2, size=n_samples)

print("input data size: %.3fMB" % (X_train.nbytes / 1e6))

model = SGDClassifier(penalty='elasticnet', n_iter=10, shuffle=True)
param_grid = [{
'alpha' : 10.0 ** -np.arange(1,7),
'l1_ratio': [.05, .15, .5, .7, .9, .95, .99, 1],
}]
gs = grid_search.GridSearchCV(model, param_grid, n_jobs=8, verbose=100)
gs.fit(X_train, y_train)
print(gs.grid_scores_)

if __name__=='__main__':
mp.freeze_support()
main()

这导致输出:

Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Started.
('numpy:', '1.8.1')
('sklearn:', '0.15.0b1')
input data size: 8000.000MB
Fitting 3 folds for each of 48 candidates, totalling 144 fits
Memmaping (shape=(1000000L, 1000L), dtype=float64) to new file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 240, in save
obj, filename = self._write_array(obj, filename)
File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 203, in _write_array
self.np.save(filename, array)
File "C:\Anaconda\lib\site-packages\numpy\lib\npyio.py", line 453, in save
format.write_array(fid, arr)
File "C:\Anaconda\lib\site-packages\numpy\lib\format.py", line 406, in write_array
array.tofile(fp)
ValueError: 1000000000 requested and 268435456 written

Memmaping (shape=(1000000L, 1000L), dtype=float64) to old file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Traceback (most recent call last):
File "S:\laszlo\gridsearch_largearray.py", line 33, in <module>
main()
File "S:\laszlo\gridsearch_largearray.py", line 28, in main
gs.fit(X_train, y_train)
File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 597, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 379, in _fit
for parameters in parameter_iterable
File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 651, in __call__
self.retrieve()
File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 503, in retrieve
self._output.append(job.get())
File "C:\Anaconda\lib\multiprocessing\pool.py", line 558, in get
raise self._value
struct.error: integer out of range for 'i' format code

编辑:ogrisel 的答案确实适用于 scikit-learn-0.15.0b1 的手动内存映射。不要忘记一次只运行一个脚本,否则您仍然会耗尽内存并拥有太多线程。 (我的运行需要 ~60 GB 的 CSV 数据,大小为 ~12.5 GB,有 8 个线程。)

最佳答案

作为一种变通方法,您可以尝试将数据显式和手动内存映射为 explained in the joblib documentation .

编辑#1:这是重要的部分:

from sklearn.externals import joblib

joblib.dump(X_train, some_filename)
X_train = joblib.load(some_filename, mmap_mode='r+')

然后将这个 memmap 数据传递给 scikit-learn 0.15+ 下的 GridSearchCV

编辑 #2: 此外:如果您使用 32 位版本的 Anaconda,每个 python 进程将被限制为 2GB,这也会限制内存。

我刚找到一个 bug对于 Python 3.4 下的 numpy.save 但即使修复后对 mmap 的后续调用也会失败:

OSError: [WinError 8] Not enough storage is available to process this command

所以请使用 64 位版本的 Python(Anaconda 作为 AFAIK 目前没有其他 64 位包用于 numpy/scipy/scikit-learn==0.15.0b1)。

编辑 #3: 我发现了另一个可能导致 Windows 下内存使用过多的问题:当前 joblib.Parallel 内存映射输入数据与 mmap_mode=' c' 默认情况下:此写时复制设置似乎会导致 Windows 耗尽分页文件,有时会触发“[错误 1455] 分页文件太小,无法完成此操作”错误。设置 mmap_mode='r'mmap_mode='r+' 不会触发该问题。我将运行测试以查看是否可以在下一版本的 joblib 中更改默认模式。

关于python - scikit-learn joblib 错误 : multiprocessing pool self. 值超出 'i' 格式代码的范围,仅适用于大型 numpy 数组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24406937/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com