gpt4 book ai didi

python - scikit-learn:ColumnTransformer 和 OneHotEncoder – 如何针对所有字段中的所有新分类级别出错?

转载 作者:行者123 更新时间:2023-12-01 08:26:31 25 4
gpt4 key购买 nike

我正在尝试使用 scikit 的 ColumnTransformer 类作为实际的 DataFrame 转换器作为“监视”转换器 - 即新类出现时要监视的对象转化为我的数据集中的分类特征。

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Original DataFrame off of which transformers are fit
orig_df = pd.DataFrame(
{
'a': [np.nan, 'a', 'b', 'b', 'a'],
'b': ([np.nan] * 3) + ['a', 'a'],
'c': np.random.randn(5)
}
)

# New DataFrame that will be transformed using already fitted transformer
new_df = pd.DataFrame(
{
'a': [np.nan, 'a', 'b', 'b', 'c'],
'b': ([np.nan] * 4) + ['b'],
'c': np.random.randn(5)
}
)

# Cast NaNs to str to play nicely with OneHotEncoder
for col in ('a', 'b'):
orig_df[col] = orig_df[col].astype(str)
new_df[col] = new_df[col].astype(str)

# Create master transformer for each of the three columns a, b, and c
transformer_config = [
('a', OneHotEncoder(sparse=False, handle_unknown='error'), ['a']),
('b', OneHotEncoder(sparse=False, handle_unknown='error'), ['b']),
('c', 'passthrough', ['c']),
]

transformer = ColumnTransformer(transformer_config)

# Fit to original dataset
transformer.fit(orig_df)

# Transform new dataset
transformer.transform(new_df)

产生:

  File "<stdin>", line 2, in <module>
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 495, in transform
Xs = self._fit_transform(X, None, _transform_one, fitted=True)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 393, in _fit_transform
fitted=fitted, replace_strings=True))
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 983, in __call__
if self.dispatch_one_batch(iterator):
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
self._dispatch(tasks)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
self.results = batch()
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 261, in __call__
for func, args, kwargs in self.items]
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
for func, args, kwargs in self.items]
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/pipeline.py", line 605, in _transform_one
res = transformer.transform(X)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 591, in transform
return self._transform_new(X)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 553, in _transform_new
X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 109, in _transform
raise ValueError(msg)
ValueError: Found unknown categories ['c'] in column 0 during transform

这会产生我通常想要的错误,但仅限于一列。正如您在 new_df 中看到的,b 列也有一个新级别 ('b')。是否有一种直接的方法来报告使用此 OneHotEncoder 类的所有字段的所有新级别,而不仅仅是第一个出错的级别?

我的第一个想法是尝试单独迭代每个字段, try catch 每个 ValueError,但这与 ColumnTransformer 配合不佳:

>>> transformer.transform(new_df[['b']])
KeyError: "None of [['a']] are in the [columns]"

最佳答案

只是针对您的示例的建议解决方案:

from sklearn.base import BaseEstimator

for _, t_inst, t_col in transformer.transformers_:
try:
if isinstance(t_inst, BaseEstimator):
t_inst.transform(new_df[t_col])
else:
pass

except Exception as e:
print('During transformation of column {} the following error occurred: {}'.format(t_col, e))

输出

During transformation of column ['a'] the following error occured: Found unknown categories ['c'] in column 0 during transform
During transformation of column ['b'] the following error occured: Found unknown categories ['b'] in column 0 during transform

它只是尝试一一应用转换。

请注意,.transformers_属性仅在拟合后才可用

关于python - scikit-learn:ColumnTransformer 和 OneHotEncoder – 如何针对所有字段中的所有新分类级别出错?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54201162/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com