gpt4 book ai didi

python - 为什么 sklearn.pipeline 中预处理方法的输出不一致?

转载 作者:行者123 更新时间:2023-11-30 08:53:42 27 4
gpt4 key购买 nike

我正在学习《Hands On Machine Learning》一书,并编写一些转换管道代码来清理我的数据,发现相同管道方法的输出根据我选择输入的数据帧的大小而变化。这是代码:

from sklearn.base import BaseEstimator,TransformerMixin    
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names =attribute_names
def fit(self,X,y=None):
return self
def transform(self,X):
return X[self.attribute_names].values

from sklearn.pipeline import FeatureUnion

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
def __init__(self, sparse_output=False):
self.sparse_output = sparse_output
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
enc = LabelBinarizer(sparse_output=self.sparse_output)
return enc.fit_transform(X)

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy='median')),
('attribs_adder', CombinedAttributesAdder()),
('std_scalar', StandardScaler())
])

cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', CustomLabelBinarizer())
])

full_pipeline = FeatureUnion(transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline)
])
housing_prepared = full_pipeline.fit_transform(housing)
data_prepared = full_pipeline.transform(housing.iloc[:5])
data_prepared1 = full_pipeline.transform(housing.iloc[:1000])
data_prepared2 = full_pipeline.transform(housing.iloc[:10000])
print(data_prepared.shape)
print(data_prepared1.shape)
print(data_prepared2.shape)

这三个打印的输出将是 (5, 14) (1000, 15) (10000, 16)谁能帮我解释一下吗?

最佳答案

那是因为,在 CustomLabelBinarizer 中,您在每次调用 transform() 时都会拟合 LabelBinarizer,因此每次都会学习不同的标签,从而获得不同的列数每次运行取决于行数。

将其更改为:

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
def __init__(self, sparse_output=False):
self.sparse_output = sparse_output
def fit(self, X, y=None):
self.enc = LabelBinarizer(sparse_output=self.sparse_output)
self.enc.fit(X)
return self
def transform(self, X, y=None):
return self.enc.transform(X)

现在我在您的代码上得到了正确的形状:

(5, 14)
(1000, 14)
(10000, 14)

注意:同样的问题有 been asked here 。我假设您正在使用 link here对于代码。如果您使用任何其他网站,则该代码可能是我链接的旧版本代码。尝试上面链接上的代码以获得无错误的更新版本。

关于python - 为什么 sklearn.pipeline 中预处理方法的输出不一致?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49987686/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com