gpt4 book ai didi

python-3.x - 使用具有 logloss 和 RFECV 的不平衡数据集的问题

转载 作者:行者123 更新时间:2023-12-04 08:02:19 34 4
gpt4 key购买 nike

我使用不平衡数据集(54:38:7%)和 RFECV 进行特征选择,如下所示:

# making a multi logloss metric
from sklearn.metrics import log_loss, make_scorer
log_loss_rfe = make_scorer(score_func=log_loss, greater_is_better=False)

# initiating Light GBM classifier
lgb_rfe = LGBMClassifier(objective='multiclass', learning_rate=0.01, verbose=0, force_col_wise=True,
random_state=100, n_estimators=5_000, n_jobs=7)

# initiating RFECV
rfe = RFECV(estimator=lgb_rfe, min_features_to_select=2, verbose=3, n_jobs=2, cv=3, scoring=log_loss_rfe)
# fitting it
rfe.fit(X=X_train, y=y_train)
我得到了一个错误,大概是因为 sklearn 的 RFECV 制作的子样本没有我数据中的所有类。在 RFECV 之外拟合完全相同的数据没有问题。
这是完整的错误:
---------------------------------------------------------------------------

_RemoteTraceback Traceback (most recent call last)

_RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
r = call_item()
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
return self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/utils/fixes.py", line 222, in __call__
return self.function(*args, **kwargs)
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/feature_selection/_rfe.py", line 37, in _rfe_single_fit
return rfe._fit(
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/feature_selection/_rfe.py", line 259, in _fit
self.scores_.append(step_score(estimator, features))
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/feature_selection/_rfe.py", line 39, in <lambda>
lambda estimator, features: _score(
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 199, in __call__
return self._score(partial(_cached_call, None), estimator, X, y_true,
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 242, in _score
return self._sign * self._score_func(y_true, y_pred,
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/home/ubuntu/ds_jup_venv/lib/python3.8/site-packages/sklearn/metrics/_classification.py", line 2265, in log_loss
raise ValueError("y_true and y_pred contain different number of "
ValueError: y_true and y_pred contain different number of classes 3, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 2]
"""


The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)

<ipython-input-9-5feb62a6f457> in <module>
1 rfe = RFECV(estimator=lgb_rfe, min_features_to_select=2, verbose=3, n_jobs=2, cv=3, scoring=log_loss_rfe)
----> 2 rfe.fit(X=X_train, y=y_train)

~/ds_jup_venv/lib/python3.8/site-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, groups)
603 func = delayed(_rfe_single_fit)
604
--> 605 scores = parallel(
606 func(rfe, self.estimator, X, y, train, test, scorer)
607 for train, test in cv.split(X, y, groups))

~/ds_jup_venv/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
1052
1053 with self._backend.retrieval_context():
-> 1054 self.retrieve()
1055 # Make sure that we get a last message telling us we are done
1056 elapsed_time = time.time() - self._start_time

~/ds_jup_venv/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
931 try:
932 if getattr(self._backend, 'supports_timeout', False):
--> 933 self._output.extend(job.get(timeout=self.timeout))
934 else:
935 self._output.extend(job.get())

~/ds_jup_venv/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e

1 frames

/usr/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
386 def __get_result(self):
387 if self._exception:
--> 388 raise self._exception
389 else:
390 return self._result

ValueError: y_true and y_pred contain different number of classes 3, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 2]
如何解决这个问题以便能够递归地选择特征?

最佳答案

对数损失需要概率预测,而不是类别预测,所以你应该添加

log_loss_rfe = make_scorer(score_func=log_loss, needs_proba=True, greater_is_better=False)
错误是因为没有那个,通过的 y_pred是一维的(类别 0,1,2)和 sklearn assumes it's a binary classification问题,这些预测是正类的概率。为了解决这个问题,它增加了负类的概率,但是与您的三个类相比,只有两列。

关于python-3.x - 使用具有 logloss 和 RFECV 的不平衡数据集的问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66396659/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com