gpt4 book ai didi

python - 如何让 GridSearchCV 在我的管道中使用自定义转换器?

转载 作者:太空狗 更新时间:2023-10-30 01:37:52 24 4
gpt4 key购买 nike

如果我排除我的自定义转换器,GridSearchCV 运行正常,但是,它会出错。这是一个假数据集:

import pandas
import numpy
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
import sklearn_pandas
from sklearn.preprocessing import MinMaxScaler

df = pandas.DataFrame({"Letter":["a","b","c","d","a","b","c","d","a","b","c","d","a","b","c","d"],
"Number":[1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4],
"Label":["G","G","B","B","G","G","B","B","G","G","B","B","G","G","B","B"]})

class MyTransformer(TransformerMixin):

def transform(self, x, **transform_args):
x["Number"] = x["Number"].apply(lambda row: row*2)
return x

def fit(self, x, y=None, **fit_args):
return self

x_train = df
y_train = x_train.pop("Label")

mapper = DataFrameMapper([
("Number", MinMaxScaler()),
("Letter", LabelBinarizer()),
])

pipe = Pipeline([
("custom", MyTransformer()),
("mapper", mapper),
("classifier", RandomForestClassifier()),
])


param_grid = {"classifier__min_samples_split":[10,20], "classifier__n_estimators":[2,3,4]}

model_grid = sklearn_pandas.GridSearchCV(pipe, param_grid, verbose=2, scoring="accuracy")

model_grid.fit(x_train, y_train)

错误是

list indices must be integers, not str

当我的管道中有自定义转换器时,如何让 GridSearchCV 工作?

最佳答案

我知道这个答案来得太晚了,但我在 sklearn 和 BaseSearchCV 派生类中遇到了相同的行为。问题实际上似乎源于 sklearn cross_validation 模块中的 _PartitionIterator 类,因为它假设管道中每个 TransformerMixin 类发出的所有内容都将是数组-like,因此它生成索引切片,用于以类似数组的方式索引传入的 X args。这是 __iter__方法:

def __iter__(self):
ind = np.arange(self.n)
for test_index in self._iter_test_masks():
train_index = np.logical_not(test_index)
train_index = ind[train_index]
test_index = ind[test_index]
yield train_index, test_index

BaseSearchCV 网格搜索元类调用 cross_validation 的 _fit_and_score ,它使用一种名为 safe_split 的方法。这是相关的行:

X_subset = [X[idx] for idx in indices]

如果 X 是一个 pandas 数据帧,这绝对会产生意想不到的结果,这是您从 transform 函数发出的。

我发现有两种方法可以解决这个问题:

  1. 确保从您的转换器返回一个数组:

    return x.as_matrix()
  2. 这是一个技巧。如果转换器管道要求下一个转换器的输入是 DataFrame,就像我的情况一样,您可以编写一个与 sklearn grid_search 模块基本相同的实用程序脚本,但包括一些巧妙的验证在 BaseSearchCV 类的 _fit 方法中调用的方法:

    def _validate_X(X):
    """Returns X if X isn't a pandas frame, otherwise
    the underlying matrix in the frame. """
    return X if not isinstance(X, pd.DataFrame) else X.as_matrix()

    def _validate_y(y):
    """Returns y if y isn't a series, otherwise the array"""
    if y is None:
    return y

    # if it's a series
    elif isinstance(y, pd.Series):
    return np.array(y.tolist())

    # if it's a dataframe:
    elif isinstance(y, pd.DataFrame):
    # check it's X dims
    if y.shape[1] > 1:
    raise ValueError('matrix provided as y')
    return y[y.columns[0]].tolist()

    # bail and let the sklearn function handle validation
    return y

例如,here's my "custom grid_search module" .

关于python - 如何让 GridSearchCV 在我的管道中使用自定义转换器?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30989036/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com