gpt4 book ai didi

python - sklearn ColumnTransformer 与 MultilabelBinarizer

转载 作者:行者123 更新时间:2023-12-01 22:10:40 25 4
gpt4 key购买 nike

我想知道是否可以在 ColumnTransformer 中使用 MultilabelBinarizer。

我有一个玩具 Pandas 数据框,例如:

df = pd.DataFrame({"id":[1,2,3], 
"text": ["some text", "some other text", "yet another text"],
"label": [["white", "cat"], ["black", "cat"], ["brown", "dog"]]})

preprocess = ColumnTransformer(
[
('vectorizer', CountVectorizer(), 'text'),
('binarizer', MultiLabelBinarizer(), ['label']),

],
remainder='drop')

但是,此代码会引发异常:

~/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
714 with _print_elapsed_time(message_clsname, message):
715 if hasattr(transformer, 'fit_transform'):
--> 716 res = transformer.fit_transform(X, y, **fit_params)
717 else:
718 res = transformer.fit(X, y, **fit_params).transform(X)

TypeError: fit_transform() takes 2 positional arguments but 3 were given

使用 OneHotEncoder,ColumnTransformer 确实可以工作。

最佳答案

对于输入XMultiLabelBinarizer适合一次处理一列(因为每一行应该是一系列类别),而 OneHotEncoder 可以处理多个列。要使 ColumnTransformer 兼容 MultiHotEncoder,您需要迭代 X 的所有列,并使用 MultiLabelBinarizer 拟合/转换每一列。以下内容应适用于 pandas.DataFrame 输入。

from sklearn.base import BaseEstimator, TransformerMixin

class MultiHotEncoder(BaseEstimator, TransformerMixin):
"""Wraps `MultiLabelBinarizer` in a form that can work with `ColumnTransformer`. Note
that input X has to be a `pandas.DataFrame`.
"""
def __init__(self):
self.mlbs = list()
self.n_columns = 0
self.categories_ = self.classes_ = list()

def fit(self, X:pd.DataFrame, y=None):
for i in range(X.shape[1]): # X can be of multiple columns
mlb = MultiLabelBinarizer()
mlb.fit(X.iloc[:,i])
self.mlbs.append(mlb)
self.classes_.append(mlb.classes_)
self.n_columns += 1
return self

def transform(self, X:pd.DataFrame):
if self.n_columns == 0:
raise ValueError('Please fit the transformer first.')
if self.n_columns != X.shape[1]:
raise ValueError(f'The fit transformer deals with {self.n_columns} columns '
f'while the input has {X.shape[1]}.'
)
result = list()
for i in range(self.n_columns):
result.append(self.mlbs[i].transform(X.iloc[:,i]))

result = np.concatenate(result, axis=1)
return result

# test
temp = pd.DataFrame({
"id":[1,2,3],
"text": ["some text", "some other text", "yet another text"],
"label": [["white", "cat"], ["black", "cat"], ["brown", "dog"]],
"label2": [["w", "c"], ["b", "c"], ["b", "d"]]
})

col_transformer = ColumnTransformer([
('one-hot', OneHotEncoder(), ['id','text']),
('multi-hot', MultiHotEncoder(), ['label', 'label2'])
])
col_transformer.fit_transform(temp)

你应该得到:

array([[1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1.],
[0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 1., 0., 0.],
[0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0.]])

请注意前 3 列和后 3 列是如何进行单热编码的,而接下来的 5 列和最后 4 列是如何进行多热编码的。可以像平常一样找到类别信息:

col_transformer.named_transformers_['one-hot'].categories_

>>> [array([1, 2, 3], dtype=object),
array(['some other text', 'some text', 'yet another text'], dtype=object)]

col_transformer.named_transformers_['multi-hot'].categories_

>>> [array(['black', 'brown', 'cat', 'dog', 'white'], dtype=object),
array(['b', 'c', 'd', 'w'], dtype=object)]

关于python - sklearn ColumnTransformer 与 MultilabelBinarizer,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59254662/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com