- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我有一个类似于
的堆叠工作流程import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import make_pipeline
import xgboost as xgb
X = np.random.random(size=(1000, 5))
y = np.random.choice([0,1], 1000)
w = np.random.random(size=(1000,))
scaler = StandardScaler()
log_reg = LogisticRegression()
params = {
'n_estimators': 10,
'max_depth': 3,
'learning_rate': 0.1
}
log_reg_pipe = make_pipeline(
scaler,
log_reg
)
stack_pipe = make_pipeline(
StackingClassifier(
estimators=[('lr', lr_stack_pipe)],
final_estimator=xgb.XGBClassifier(**params),
passthrough=True,
cv=2
)
)
我希望能够将样本权重传递到 xgboost。我的问题是如何在最终估算器中设置样本权重?
我试过了
stack_pipe.fit(X, y, sample_weights=w)
抛出
ValueError: Pipeline.fit does not accept the sample_weights parameter. You can pass parameters to specific steps of your pipeline using the stepname__parameter format, e.g. `Pipeline.fit(X, y, logisticregression__sample_weight=sample_weight)`
最佳答案
我最近还意识到堆叠估算器无法处理样本加权管道。我通过子类化 scikit-learn 的 StackingRegressor
和 StackingClassifier
类并覆盖其 fit()
方法来解决这个问题,以更好地管理管道。请看以下内容:
"""Implement StackingClassifier that can handle sample-weighted Pipelines."""
from sklearn.ensemble import StackingRegressor, StackingClassifier
from copy import deepcopy
import numpy as np
from joblib import Parallel
from sklearn.base import clone
from sklearn.base import is_classifier, is_regressor
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import check_cv
from sklearn.utils import Bunch
from sklearn.utils.fixes import delayed
from sklearn.pipeline import Pipeline
ESTIMATOR_NAME_IN_PIPELINE = 'estimator'
def new_fit_single_estimator(estimator, X, y, sample_weight=None,
message_clsname=None, message=None):
"""Private function used to fit an estimator within a job."""
if sample_weight is not None:
try:
if isinstance(estimator, Pipeline):
# determine name of final estimator
estimator_name = estimator.steps[-1][0]
kwargs = {estimator_name + '__sample_weight': sample_weight}
estimator.fit(X, y, **kwargs)
else:
estimator.fit(X, y, sample_weight=sample_weight)
except TypeError as exc:
if "unexpected keyword argument 'sample_weight'" in str(exc):
raise TypeError(
"Underlying estimator {} does not support sample weights."
.format(estimator.__class__.__name__)
) from exc
raise
else:
estimator.fit(X, y)
return estimator
class FlexibleStackingClassifier(StackingClassifier):
def __init__(self, estimators, final_estimator=None, *, cv=None,
n_jobs=None, passthrough=False, verbose=0):
super().__init__(
estimators=estimators,
final_estimator=final_estimator,
cv=cv,
n_jobs=n_jobs,
passthrough=passthrough,
verbose=verbose
)
def fit(self, X, y, sample_weight=None):
"""Fit the estimators.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Training vectors, where `n_samples` is the number of samples and
`n_features` is the number of features.
y : array-like of shape (n_samples,)
Target values.
sample_weight : array-like of shape (n_samples,) or default=None
Sample weights. If None, then samples are equally weighted.
Note that this is supported only if all underlying estimators
support sample weights.
.. versionchanged:: 0.23
when not None, `sample_weight` is passed to all underlying
estimators
Returns
-------
self : object
"""
# all_estimators contains all estimators, the one to be fitted and the
# 'drop' string.
names, all_estimators = self._validate_estimators()
self._validate_final_estimator()
stack_method = [self.stack_method] * len(all_estimators)
# Fit the base estimators on the whole training data. Those
# base estimators will be used in transform, predict, and
# predict_proba. They are exposed publicly.
self.estimators_ = Parallel(n_jobs=self.n_jobs)(
delayed(new_fit_single_estimator)(clone(est), X, y, sample_weight)
for est in all_estimators if est != 'drop'
)
self.named_estimators_ = Bunch()
est_fitted_idx = 0
for name_est, org_est in zip(names, all_estimators):
if org_est != 'drop':
self.named_estimators_[name_est] = self.estimators_[
est_fitted_idx]
est_fitted_idx += 1
else:
self.named_estimators_[name_est] = 'drop'
# To train the meta-classifier using the most data as possible, we use
# a cross-validation to obtain the output of the stacked estimators.
# To ensure that the data provided to each estimator are the same, we
# need to set the random state of the cv if there is one and we need to
# take a copy.
cv = check_cv(self.cv, y=y, classifier=is_classifier(self))
if hasattr(cv, 'random_state') and cv.random_state is None:
cv.random_state = np.random.RandomState()
self.stack_method_ = [
self._method_name(name, est, meth)
for name, est, meth in zip(names, all_estimators, stack_method)
]
fit_params = ({f"{ESTIMATOR_NAME_IN_PIPELINE}__sample_weight": sample_weight}
if sample_weight is not None
else None)
predictions = Parallel(n_jobs=self.n_jobs)(
delayed(cross_val_predict)(clone(est), X, y, cv=deepcopy(cv),
method=meth, n_jobs=self.n_jobs,
fit_params=fit_params,
verbose=self.verbose)
for est, meth in zip(all_estimators, self.stack_method_)
if est != 'drop'
)
# Only not None or not 'drop' estimators will be used in transform.
# Remove the None from the method as well.
self.stack_method_ = [
meth for (meth, est) in zip(self.stack_method_, all_estimators)
if est != 'drop'
]
X_meta = self._concatenate_predictions(X, predictions)
new_fit_single_estimator(self.final_estimator_, X_meta, y,
sample_weight=sample_weight)
return self
class FlexibleStackingRegressor(StackingRegressor):
def __init__(self, estimators, final_estimator=None, *, cv=None,
n_jobs=None, passthrough=False, verbose=0):
super().__init__(
estimators=estimators,
final_estimator=final_estimator,
cv=cv,
n_jobs=n_jobs,
passthrough=passthrough,
verbose=verbose
)
def fit(self, X, y, sample_weight=None):
"""Fit the estimators.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Training vectors, where `n_samples` is the number of samples and
`n_features` is the number of features.
y : array-like of shape (n_samples,)
Target values.
sample_weight : array-like of shape (n_samples,) or default=None
Sample weights. If None, then samples are equally weighted.
Note that this is supported only if all underlying estimators
support sample weights.
.. versionchanged:: 0.23
when not None, `sample_weight` is passed to all underlying
estimators
Returns
-------
self : object
"""
# all_estimators contains all estimators, the one to be fitted and the
# 'drop' string.
names, all_estimators = self._validate_estimators()
self._validate_final_estimator()
stack_method = [self.stack_method] * len(all_estimators)
# Fit the base estimators on the whole training data. Those
# base estimators will be used in transform, predict, and
# predict_proba. They are exposed publicly.
self.estimators_ = Parallel(n_jobs=self.n_jobs)(
delayed(new_fit_single_estimator)(clone(est), X, y, sample_weight)
for est in all_estimators if est != 'drop'
)
self.named_estimators_ = Bunch()
est_fitted_idx = 0
for name_est, org_est in zip(names, all_estimators):
if org_est != 'drop':
self.named_estimators_[name_est] = self.estimators_[
est_fitted_idx]
est_fitted_idx += 1
else:
self.named_estimators_[name_est] = 'drop'
# To train the meta-classifier using the most data as possible, we use
# a cross-validation to obtain the output of the stacked estimators.
# To ensure that the data provided to each estimator are the same, we
# need to set the random state of the cv if there is one and we need to
# take a copy.
cv = check_cv(self.cv, y=y, classifier=is_classifier(self))
if hasattr(cv, 'random_state') and cv.random_state is None:
cv.random_state = np.random.RandomState()
self.stack_method_ = [
self._method_name(name, est, meth)
for name, est, meth in zip(names, all_estimators, stack_method)
]
fit_params = ({f"{ESTIMATOR_NAME_IN_PIPELINE}__sample_weight": sample_weight}
if sample_weight is not None
else None)
predictions = Parallel(n_jobs=self.n_jobs)(
delayed(cross_val_predict)(clone(est), X, y, cv=deepcopy(cv),
method=meth, n_jobs=self.n_jobs,
fit_params=fit_params,
verbose=self.verbose)
for est, meth in zip(all_estimators, self.stack_method_)
if est != 'drop'
)
# Only not None or not 'drop' estimators will be used in transform.
# Remove the None from the method as well.
self.stack_method_ = [
meth for (meth, est) in zip(self.stack_method_, all_estimators)
if est != 'drop'
]
X_meta = self._concatenate_predictions(X, predictions)
new_fit_single_estimator(self.final_estimator_, X_meta, y,
sample_weight=sample_weight)
return self
我包括了回归器和分类器版本,尽管您似乎只需要能够使用分类器子类。
但请注意:您必须在管道中为估算器指定相同的名称,并且该名称必须与下面定义的 ESTIMATOR_NAME_IN_PIPELINE
字段对齐。否则代码将无法运行。例如,这里将是一个适当定义的 Pipeline
实例,使用与上面显示的类定义脚本中定义的相同的名称:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import TweedieRegressor
from sklearn.feature_selection import VarianceThreshold
validly_named_pipeline = Pipeline([
('variance_threshold', VarianceThreshold()),
('scaler', StandardScaler()),
('estimator', TweedieRegressor())
])
这并不理想,但这是我目前拥有的,无论如何都应该可以工作。
编辑:为了清楚起见,当我覆盖 fit()
方法时,我只是从 scikit 存储库中复制并粘贴代码并进行了必要的更改,这只有几行。粘贴的代码中有很多不是我的原创作品,而是 scikit 开发人员的作品。
关于python - sklearn StackingClassifier 和样本权重,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65850996/
我知道有几个类似的问题被问到,但我的问题仍然没有得到解答。 问题来了。我使用命令 python3 -m pip3 install -U scikit-learn 来安装 sklearn、numpy 和
_train_weather.values : [[ 0.61818182 0.81645199 0.6679803 ..., 0. 0. 1.
如果我有一个数据集X及其标签Y,那么我将其分为训练集和测试集,scle为0.2,并使用随机种子进行洗牌: 11 >>>X.shape (10000, 50,50) train_data, test_d
首先我查看了所有相关问题。给出了非常相似的问题。 所以我遵循了链接中的建议,但没有一个对我有用。 Data Conversion Error while applying a function to
这里有两种标准化方法: 1:这个在数据预处理中使用:sklearn.preprocessing.normalize(X,norm='l2') 2:分类方法中使用另一种方法:sklearn.svm.Li
所以刚看了一个教程,作者不需要import sklearn使用时 predict anaconda 环境中pickled 模型的功能(安装了sklearn)。 我试图在 Google Colab 中重
我想评估我的机器学习模型。我使用 roc_auc_score() 计算了 ROC 曲线下的面积,并使用 sklearn 的 plot_roc_curve() 函数绘制了 ROC 曲线。在第二个函数中,
我一直在寻找此信息,但在任何地方都找不到,所以这是我的镜头。 我是Python 2.7的初学者,我学习了一个模型,感谢cPickle我保存了它,但现在我想知道是否可以从另一个设备(没有sklearn库
>>> import sklearn.model_selection.train_test_split Traceback (most recent call last): File "", li
在阅读有关使用 python 的 LinearDiscriminantAnalysis 的过程中,我有两种不同的方法来实现它,可在此处获得, http://scikit-learn.org/stabl
我正在使用 sklearn,我注意到 sklearn.metrics.plot_confusion_matrix 的参数和 sklearn.metrics.confusion_matrix不一致。 p
我正在构建一个多标签文本分类程序,我正在尝试使用 OneVsRestClassifier+XGBClassifier 对文本进行分类。最初,我使用 Sklearn 的 Tf-Idf 矢量化来矢量化文本
我想看看模型是否收敛于我的交叉验证。我如何增加或减少 sklearn.svm.SVC 中的时代? 目前: SVM_Model = SVC(gamma='auto') SVM_Model.fit(X_t
有人可以帮助我吗?我很难知道它们之间的区别 from sklearn.model_selection import train_test_split from sklearn.cross_valida
我需要提取在 sklearn.ensemble.BaggingClassifier 中训练的每个模型的概率。这样做的原因是为了估计 XGBoostClassifier 模型的不确定性。 为此,我创建了
无法使用 scikit-learn 0.19.1 导入 sklearn.qda 和 sklearn.lda 我得到: 导入错误:没有名为“sklearn.qda”的模块 导入错误:没有名为“sklea
我正在尝试在 google cloud ai 平台上创建一个版本,但找不到 impute 模块 No module named 'sklearn.impute._base; 'sklearn.impu
我在 PyQt5 中编写了一个 GUI,其中包括以下行 from sklearn.ensemble import RandomForestClassifier 。 遵循this answer中的建议,
我正在做一个 Kaggle 比赛,需要输入一些缺失的数据。我安装了最新的Anaconda(4.5.4)具有所有相关依赖项(即 scikit-learn (0.19.1) )。 当我尝试导入模块时,出现
在安装了所需的模块后,我正在尝试将imblearn导入到我的Python笔记本中。但是,我收到以下错误:。。附加信息:我使用的是一个用Visual Studio代码编写的虚拟环境。。我已经确定venv
我是一名优秀的程序员,十分优秀!