gpt4 book ai didi

python - scikit learn 中不同数据类型的自定义管道

转载 作者:行者123 更新时间:2023-11-30 09:09:01 24 4
gpt4 key购买 nike

我目前正在尝试根据一堆整数和一些文本特征来预测 kickstarter 项目是否会成功。我正在考虑建立一个看起来像这样的管道

引用:http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

这是我的 ItemSelector 和管道代码

class ItemSelector(BaseEstimator, TransformerMixin):    
def __init__(self, keys):
self.keys = keys

def fit(self, x, y=None):
return self

def transform(self, data_dict):
return data_dict[self.keys]

我验证了 ItemSelector 是否按预期工作

t = ItemSelector(['cleaned_text'])
t.transform(df)

And it extract the necessary columns

管道

pipeline = Pipeline([
# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the post's subject line
('text', Pipeline([
('selector', ItemSelector(['cleaned_text'])),
('counts', CountVectorizer()),
('tf_idf', TfidfTransformer())
])),

# Pipeline for pulling ad hoc features from post's body
('integer_features', ItemSelector(int_features)),
]
)),

# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
])

但是当我运行 pipeline.fit(X_train, y_train) 时,我收到此错误。知道如何解决这个问题吗?

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-317e1c402966> in <module>()
----> 1 pipeline.fit(X_train, y_train)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
266 This estimator
267 """
--> 268 Xt, fit_params = self._fit(X, y, **fit_params)
269 if self._final_estimator is not None:
270 self._final_estimator.fit(Xt, y, **fit_params)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
232 pass
233 elif hasattr(transform, "fit_transform"):
--> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
235 else:
236 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
740 self._update_transformer_list(transformers)
741 if any(sparse.issparse(f) for f in Xs):
--> 742 Xs = sparse.hstack(Xs).tocsr()
743 else:
744 Xs = np.hstack(Xs)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
456
457 """
--> 458 return bmat([blocks], format=format, dtype=dtype)
459
460

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
577 exp=brow_lengths[i],
578 got=A.shape[0]))
--> 579 raise ValueError(msg)
580
581 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 81096, expected 1.

最佳答案

ItemSelector 返回一个 Dataframe,而不是一个数组。这就是 scipy.hstack 抛出错误的原因。更改 ItemSelector 如下:

class ItemSelector(BaseEstimator, TransformerMixin):    
....
....
....

def transform(self, data_dict):
return data_dict[self.keys].as_matrix()

错误发生在管道的 integer_features 部分。对于第一部分 text,ItemSelector 下面的转换器支持 Dataframe,因此可以将其正确转换为数组。但第二部分只有 ItemSelector 并返回 Dataframe。

更新:

在评论中,您提到您想要对从 ItemSelector 返回的结果 Dataframe 执行一些操作。因此,您可以创建一个新的 Transformer 并将其附加到管道的第二部分,而不是修改 ItemSelector 的转换方法。

class DataFrameToArrayTransformer(BaseEstimator, TransformerMixin):    
def __init__(self):

def fit(self, x, y=None):
return self

def transform(self, X):
return X.as_matrix()

那么你的管道应该如下所示:

pipeline = Pipeline([
# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the post's subject line
('text', Pipeline([
('selector', ItemSelector(['cleaned_text'])),
('counts', CountVectorizer()),
('tf_idf', TfidfTransformer())
])),

# Pipeline for pulling ad hoc features from post's body
('integer', Pipeline([
('integer_features', ItemSelector(int_features)),
('array', DataFrameToArrayTransformer()),
])),
]
)),

# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
])

这里要理解的主要事情是,FeatureUnion 在组合二维数组时只会处理它们,因此任何其他类型(例如 DataFrame)可能会出现问题。

关于python - scikit learn 中不同数据类型的自定义管道,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45048615/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com