gpt4 book ai didi

python - 不知道如何将 sklearn 与包含文本和数字的特征向量一起使用

转载 作者:行者123 更新时间:2023-11-30 08:54:33 24 4
gpt4 key购买 nike

我刚刚开始使用sklearn,我想对产品进行分类。产品出现在订单行上,并具有诸如描述、价格、制造商、订单数量等属性。其中一些属性是文本,其他属性是数字(整数或 float )。我想使用这些属性来预测产品是否需要维护。我们购买的产品可以是发动机、泵等,也可以是螺母、软管、过滤器等。到目前为止,我根据价格和数量进行了预测,并根据描述或制造商进行了其他预测。现在我想结合这些预测,但我不知道该怎么做。我看过 Pipeline 和 FeatureUnion 页面,但它让我感到困惑。有人有一个简单的例子来说明如何预测同时具有文本和数字列的数据吗?

我现在有:

order_lines.head(5)

Part No Part Description Quantity Price/Base Supplier Name Purch UoM Category
0 1112165 Duikwerkzaamheden 1.0 750.00 Duik & Bergingsbedrijf Europa B.V. pcs 0
1 1112165 Duikwerkzaamheden bij de helling 1.0 500.00 Duik & Bergingsbedrijf Europa B.V. pcs 0
2 1070285 Inspectie boegschroef, dd. 26-03-2012 1.0 0.01 Duik & Bergingsbedrijf Europa B.V. pcs 0
3 1037024 Spare parts Albanie Acc. List 1.0 3809.16 Lastechniek Europa B.V. - 0
4 1037025 M_PO:441.35/BW_INV:0 1.0 0.00 Exalto pcs 0

category_column = order_lines['Category']
order_lines = order_lines[['Part Description', 'Quantity', 'Price/Base', 'Supplier Name', 'Purch UoM']]

from sklearn.cross_validation import train_test_split
features_train, features_test, target_train, target_test = train_test_split(order_lines, category_column, test_size=0.20)

from sklearn.base import TransformerMixin, BaseEstimator

class FeatureTypeSelector(TransformerMixin, BaseEstimator):
FEATURE_TYPES = {
'price and quantity': [
'Price/Base',
'Quantity',
],
'description, supplier, uom': [
'Part Description',
'Supplier Name',
'Purch UoM',
],
}
def __init__(self, feature_type):
self.columns = self.FEATURE_TYPES[feature_type]

def fit(self, X, y=None):
return self

def transform(self, X):
return X[self.columns]

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import RobustScaler

preprocessor = make_union(
make_pipeline(
FeatureTypeSelector('price and quantity'),
RobustScaler(),
),
make_pipeline(
FeatureTypeSelector('description, supplier, uom'),
CountVectorizer(),
),
)
preprocessor.fit_transform(features_train)

然后我得到了这个错误:

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-51-f8b0db33462a> in <module>()
----> 1 preprocessor.fit_transform(features_train)

C:\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
500 self._update_transformer_list(transformers)
501 if any(sparse.issparse(f) for f in Xs):
--> 502 Xs = sparse.hstack(Xs).tocsr()
503 else:
504 Xs = np.hstack(Xs)

C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
462
463 """
--> 464 return bmat([blocks], format=format, dtype=dtype)
465
466

C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
579 else:
580 if brow_lengths[i] != A.shape[0]:
--> 581 raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
582
583 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions

最佳答案

我建议不要对不同的特征类型进行预测然后组合。您最好按照您的建议使用FeatureUnion,它允许您为每种功能类型创建单独的预处理管道。我经常使用的结构如下......

让我们定义一个玩具示例数据集来玩一下:

import pandas as pd

# create a pandas dataframe that contains your features
X = pd.DataFrame({'quantity': [13, 7, 42, 11],
'item_name': ['nut', 'bolt', 'bolt', 'chair'],
'item_type': ['hardware', 'hardware', 'hardware', 'furniture'],
'item_price': [1.95, 4.95, 2.79, 19.95]})

# create corresponding target (this is often just one of the dataframe columns)
y = pd.Series([0, 1, 1, 0], index=X.index)

我使用PipelineFeatureUnion(或者更确切地说是它们更简单的快捷方式make_pipelinemake_union)将所有内容粘合在一起:

from sklearn.pipeline import make_union, make_pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression

# create your preprocessor that handles different feature types separately
preprocessor = make_union(
make_pipeline(
FeatureTypeSelector('continuous'),
RobustScaler(),
),
make_pipeline(
FeatureTypeSelector('categorical'),
RowToDictTransformer(),
DictVectorizer(sparse=False), # set sparse=True if you get MemoryError
),
)

# example use of your combined preprocessor
preprocessor.fit_transform(X)

# choose some estimator
estimator = LogisticRegression()

# your prediction model can be created as follows
model = make_pipeline(preprocessor, estimator)

# and training is done as follows
model.fit(X, y)

# predict (preferably not on training data X)
model.predict(X)

在这里,我定义了自己的自定义转换器FeatureTypeSelectorRowToDictTransformer,如下所示:

from sklearn.base import TransformerMixin, BaseEstimator


class FeatureTypeSelector(TransformerMixin, BaseEstimator):
""" Selects a subset of features based on their type """

FEATURE_TYPES = {
'categorical': [
'item_name',
'item_type',
],
'continuous': [
'quantity',
'item_price',
]
}

def __init__(self, feature_type):
self.columns = self.FEATURE_TYPES[feature_type]

def fit(self, X, y=None):
return self

def transform(self, X):
return X[self.columns]


class RowToDictTransformer(TransformerMixin, BaseEstimator):
""" Prepare dataframe for DictVectorizer """

def fit(self, X, y=None):
return self

def transform(self, X):
return (row[1] for row in X.iterrows())

希望这个示例能够更清晰地描绘出如何进行特征联合。

-克里斯

关于python - 不知道如何将 sklearn 与包含文本和数字的特征向量一起使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39007083/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com