gpt4 book ai didi

python - 如何使用 sklearn 管道实现缓存

转载 作者:行者123 更新时间:2023-12-05 05:43:29 26 4
gpt4 key购买 nike

我看过以下内容:Using scikit Pipeline for testing models but preprocessing data only once ,但这不起作用。我正在使用 scikit-learn 1.0.2

例子:

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.pipeline import Pipeline
from tempfile import mkdtemp
from joblib import Memory
import time
from shutil import rmtree

class Test(BaseEstimator, TransformerMixin):
def __init__(self, col):
self.col = col

def fit(self, X, y=None):
return self

def transform(self, X, y=None):
for t in range(5):
# just to slow it down / check caching.
print(".")
time.sleep(1)
print(self.col)

cachedir = mkdtemp()
memory = Memory(location=cachedir, verbose=10)


pipline = Pipeline(
[
("test", Test(col="this_column")),
],
memory=memory,
)

pipline.fit_transform(None)

将显示:

.
.
.
.
.
this_column

第二次调用它时,我期望它被缓存,因此不必显示五个 。\n.\n.\n.\n.this_column 之前输出。

但这并没有发生,它为我提供了带有 time.sleep 的 for 循环的输出。

为什么会这样?

最佳答案

管道的最后一步似乎没有缓存。这是您的脚本的略微修改版本。

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
import time

class Test(BaseEstimator, TransformerMixin):
def __init__(self, col):
self.col = col

def fit(self, X, y=None):
print(self.col)
return self

def transform(self, X, y=None):
for t in range(5):
# just to slow it down / check caching.
print(".")
time.sleep(1)
#print(self.col)
return X

pipline = Pipeline(
[
("test", Test(col="this_column")),
("test2", Test(col="that_column"))
],
memory="tmp/cache",
)

pipline.fit(None)
pipline.fit(None)
pipline.fit(None)

#this_column
#.
#.
#.
#.
#.
#that_column
#that_column
#that_column

关于python - 如何使用 sklearn 管道实现缓存,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71812869/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com