python - imblearn 管道是否会关闭采样以进行测试？-6ren

python - imblearn 管道是否会关闭采样以进行测试？

转载作者：行者123 更新时间：2023-12-03 14:06:36

26

4

让我们假设以下代码(来自 imblearn example on pipelines)

...    
# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components=2)

# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()

# Create the classifier
knn = KNN(1)

# Make the splits
X_train, X_test, y_train, y_test = tts(X, y, random_state=42)

# Add one transformers and two samplers in the pipeline object
pipeline = make_pipeline(pca, enn, renn, knn)

pipeline.fit(X_train, y_train)
y_hat = pipeline.predict(X_test)

我想确保在执行 pipeline.predict(X_test) 时采样程序 enn和 renn不会被执行(当然 pca 必须被执行)。

First, it is clear to me that over-, under-, and mixed-sampling areprocedures to be applied to the training set, not to thetest/validation set. Please correct me here if I am wrong.

I browsed though the imblearn Pipeline code but I could not findthe predict method there.

I also would like to be sure that this correct behavior works whenthe pipeline is inside a GridSearchCV

我只需要确保 imblearn.Pipeline 会发生这种情况。 .
编辑:2020-08-28
@wundermahn 答案就是我所需要的。
此编辑只是补充说这是应该使用 imblearn.Pipeline 的原因。用于不平衡的预处理而不是 sklearn.Pipeline imblearn 无处可去文档我找到了解释为什么需要 imblearn.Pipeline当有 sklearn.Pipeline

最佳答案

很好的问题。要按照您发布的顺序浏览它们:

First, it is clear to me that over-, under-, and mixed-sampling are procedures to be applied to the training set, not to thetest/validation set. Please correct me here if I am wrong.

那是正确的。您当然不想在 的数据上测试(无论是在您的 test 还是 validation 数据上)不是 代表实际的、实时的、“生产”数据集。您真的应该只将其应用于培训。请注意，如果您使用交叉折叠验证等技术，则应将采样单独应用于每个折叠，如 this IEEE paper 所示。 .

I browsed though the imblearn Pipeline code but I could not find the predict method there.

我假设您找到了 imblearn.pipeline source code ，所以如果你这样做了，你想做的是看看 fit_predict方法:

 @if_delegate_has_method(delegate="_final_estimator")
    def fit_predict(self, X, y=None, **fit_params):
        """Apply `fit_predict` of last step in pipeline after transforms.
        Applies fit_transforms of a pipeline to the data, followed by the
        fit_predict method of the final estimator in the pipeline. Valid
        only if the final estimator implements fit_predict.
        Parameters
        ----------
        X : iterable
            Training data. Must fulfill input requirements of first step of
            the pipeline.
        y : iterable, default=None
            Training targets. Must fulfill label requirements for all steps
            of the pipeline.
        **fit_params : dict of string -> object
            Parameters passed to the ``fit`` method of each step, where
            each parameter name is prefixed such that parameter ``p`` for step
            ``s`` has key ``s__p``.
        Returns
        -------
        y_pred : ndarray of shape (n_samples,)
            The predicted target.
        """
        Xt, yt, fit_params = self._fit(X, y, **fit_params)
        with _print_elapsed_time('Pipeline',
                                 self._log_message(len(self.steps) - 1)):
            y_pred = self.steps[-1][-1].fit_predict(Xt, yt, **fit_params)
        return y_pred

在这里，我们可以看到 pipeline使用 .predict管道中最终估算器的方法，在您发布的示例中， scikit-learn's knn :

 def predict(self, X):
        """Predict the class labels for the provided data.
        Parameters
        ----------
        X : array-like of shape (n_queries, n_features), \
                or (n_queries, n_indexed) if metric == 'precomputed'
            Test samples.
        Returns
        -------
        y : ndarray of shape (n_queries,) or (n_queries, n_outputs)
            Class labels for each data sample.
        """
        X = check_array(X, accept_sparse='csr')

        neigh_dist, neigh_ind = self.kneighbors(X)
        classes_ = self.classes_
        _y = self._y
        if not self.outputs_2d_:
            _y = self._y.reshape((-1, 1))
            classes_ = [self.classes_]

        n_outputs = len(classes_)
        n_queries = _num_samples(X)
        weights = _get_weights(neigh_dist, self.weights)

        y_pred = np.empty((n_queries, n_outputs), dtype=classes_[0].dtype)
        for k, classes_k in enumerate(classes_):
            if weights is None:
                mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
            else:
                mode, _ = weighted_mode(_y[neigh_ind, k], weights, axis=1)

            mode = np.asarray(mode.ravel(), dtype=np.intp)
            y_pred[:, k] = classes_k.take(mode)

        if not self.outputs_2d_:
            y_pred = y_pred.ravel()

        return y_pred

I also would like to be sure that this correct behaviour works when the pipeline is inside a GridSearchCV

这种假设上述两个假设是正确的，我认为这意味着您想要一个 complete, minimal, reproducible example这在 GridSearchCV 中工作。来自 scikit-learn on this 的大量文档，但是我使用 knn 创建的示例在下面:

import pandas as pd, numpy as np

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split

param_grid = [
    {
        'classification__n_neighbors': [1,3,5,7,10],
    }
]

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.20)

pipe = Pipeline([
    ('sampling', SMOTE()),
    ('classification', KNeighborsClassifier())
])

grid = GridSearchCV(pipe, param_grid=param_grid)
grid.fit(X_train, y_train)
mean_scores = np.array(grid.cv_results_['mean_test_score'])
print(mean_scores)

# [0.98051926 0.98121129 0.97981998 0.98050474 0.97494193]

你的直觉很准，干得好:)

关于python - imblearn 管道是否会关闭采样以进行测试？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63520908/

26

4

0

文章推荐： python - Pandas DataFrame 填充列中的缺失值

文章推荐： python - 在 AWS S3 中分块创建大型 zip 文件

Python 是否
我有一个 if 语句，如下所示 if (not(fullpath.lower().endswith(".pdf")) or not (fullpath.lower().endswith(tup
php - 是否/是否有任何浏览器允许控制流构造在脚本标签中存活？
然而，在 PHP 中，可以: only appears if $foo is true. only appears if $foo is false. 在 Javascript 中，能否在一个脚
binary - 是否(曾经有过)为任意二进制格式创建模式语言的努力？
XML有很多好处。它既是机器可读的，也是人类可读的，它具有标准化的格式，并且用途广泛。它也有一些缺点。它是冗长的，不是传输大量数据的非常有效的方法。 XML最有用的方面之一是模式语言。使用模式，您可
sql-server - 是否 CTE
由于长期使用 SQL2000，我并没有真正深入了解公用表表达式。我给出的答案here (#4025380)和 here (#4018793)违背了潮流，因为他们没有使用 CTE。我很欣赏它们对于递
java - 是否 hibernate 分离对象的默认乐观锁定？
我有一个应用程序: void deleteObj(id){ MyObj obj = getObjById(id); if (obj == null) { throw n
mysql - 是否 hibernate 关闭连接？
我的代码如下。可能我以类似的方式多次使用它，即简单地说，我正在以这种方式管理 session 和事务: List users= null; try{ sess
android - 是否/是否有适用于Android的标准程序包结构/层次结构做法？
在开发J2EE Web应用程序时，我通常会按以下方式组织我的包结构 com.jameselsey.. 控制器-控制器/操作转到此处服务-事务服务类，由控制器调用域-应用程序使用的我的域类/对象 D
c++ -/是否/memmove 使用中间缓冲区？
这更多是出于好奇而不是任何重要问题，但我只是想知道 memmove 中的以下片段文档: Copying takes place as if an intermediate buffer were us
algorithm - 在联合查找算法中，是否/如何调整节点在路径压缩中的等级
路径压缩涉及将根指定为路径上每个节点的新父节点——这可能会降低根的等级，并可能降低路径上所有节点的等级。有办法解决这个问题吗？有必要处理这个吗？或者，也许可以将等级视为树高的上限而不是确切的高度？谢
C++ 是否 reinterpret_cast 总是返回结果？
我有两个类，A 和 B。A 是 B 的父类，我有一个函数接收指向 A 类型类的指针，检查它是否也是 B 类型，如果是将调用另一个函数，该函数接受一个指向类型 B 的类的指针。当函数调用另一个函数时，我
c++ - Valgrind 是否/可以使用多个处理器？
有没有办法让 valgrind 使用多个处理器？我正在使用 valgrind 的 callgrind 进行一些瓶颈分析，并注意到我的应用程序中的资源使用行为与在 valgrind/callgrind
haskell - 是否/应该将函数包装到 monad 转换器中被视为不好的做法？
假设我们要使用 ReaderT [(a,b)]超过 Maybe monad，然后我们想在列表中进行查找。现在，一个简单且不常见的方法是: 第一种可能性 find a = ReaderT (looku
jQuery 检查 attr 是否=值
我的代码似乎有问题。我需要说的是: if ( $('html').attr('lang').val() == 'fr-FR' ) { // do this } else { // do
azure - AKS 是否/是否支持跨更新域传播 Pod？
根据this文章(2018 年 4 月)AKS 在可用性集中运行时能够跨故障域智能放置 Pod，但尚不考虑更新域。很快就会使用更新域将 Pod 放入 AKS 中吗？最佳答案当您设置集群时，它已经自
php - 查询以检查同一表中的 row1 = row2 是否
course | section | type comart2 : bsit201 : lec comart2 :
android - AAR 依赖项 - 是否 bundle ？
我正在开发自己的 SDK，而这又依赖于某些第 3 方 SDK。例如 - OkHttp。我应该将 OkHttp 添加到我的 build.gradle 中，还是让我的 SDK 用户包含它？在这种情况下，
functional-programming - Rust 是否/将支持函数式编程习惯用法？
随着 Rust 越来越充实，我对它的兴趣开始激起。我喜欢它支持代数数据类型，尤其是那些匹配的事实，但是对其他功能习语有什么想法吗？例如标准库中是否有标准过滤器/映射/归约函数的集合，更重要的是，您能
html - h1 :before{ } work for seo? 是否
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。这个问题似乎与 help center 中定义的范围内的编程无关。 . 关闭 9 年前。 Improve
php - 是否/为什么 php 强制您使用对象构造函数
我一直在研究 PHP 中的对象。我见过的所有示例甚至在它们自己的对象上都使用了对象构造函数。 PHP 会强制您这样做吗？如果是，为什么？例如: firstname = $firstname;
php - PHP 是否(在内部)以不同方式处理数字索引数组？
...比关联数组？关联数组会占用更多内存吗？ $arr = array(1, 1, 1); $arr[10] = 1; $arr[] = 1; // <- index is 11; does the

首页

博学

6Ren·AI

商城

python - imblearn 管道是否会关闭采样以进行测试？