python - 不知道如何将 sklearn 与包含文本和数字的特征向量一起使用-6ren

python - 不知道如何将 sklearn 与包含文本和数字的特征向量一起使用

转载作者：行者123 更新时间：2023-11-30 08:54:33

我刚刚开始使用sklearn，我想对产品进行分类。产品出现在订单行上，并具有诸如描述、价格、制造商、订单数量等属性。其中一些属性是文本，其他属性是数字(整数或 float )。我想使用这些属性来预测产品是否需要维护。我们购买的产品可以是发动机、泵等，也可以是螺母、软管、过滤器等。到目前为止，我根据价格和数量进行了预测，并根据描述或制造商进行了其他预测。现在我想结合这些预测，但我不知道该怎么做。我看过 Pipeline 和 FeatureUnion 页面，但它让我感到困惑。有人有一个简单的例子来说明如何预测同时具有文本和数字列的数据吗？

我现在有:

order_lines.head(5)

    Part No Part Description    Quantity    Price/Base  Supplier Name   Purch UoM   Category
0   1112165 Duikwerkzaamheden   1.0 750.00  Duik & Bergingsbedrijf Europa B.V.  pcs 0
1   1112165 Duikwerkzaamheden bij de helling    1.0 500.00  Duik & Bergingsbedrijf Europa B.V.  pcs 0
2   1070285 Inspectie boegschroef, dd. 26-03-2012   1.0 0.01    Duik & Bergingsbedrijf Europa B.V.  pcs 0
3   1037024 Spare parts Albanie Acc. List   1.0 3809.16 Lastechniek Europa B.V. -   0
4   1037025 M_PO:441.35/BW_INV:0    1.0 0.00    Exalto  pcs 0

category_column = order_lines['Category']
order_lines = order_lines[['Part Description', 'Quantity', 'Price/Base', 'Supplier Name', 'Purch UoM']]

from sklearn.cross_validation import train_test_split
features_train, features_test, target_train, target_test = train_test_split(order_lines, category_column, test_size=0.20)

from sklearn.base import TransformerMixin, BaseEstimator

class FeatureTypeSelector(TransformerMixin, BaseEstimator):
    FEATURE_TYPES = {
        'price and quantity': [
            'Price/Base',
            'Quantity',
        ],
        'description, supplier, uom': [
            'Part Description',
            'Supplier Name',
            'Purch UoM',
        ],
    }
    def __init__(self, feature_type):
        self.columns = self.FEATURE_TYPES[feature_type]

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.columns]

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import RobustScaler

preprocessor = make_union(
    make_pipeline(
        FeatureTypeSelector('price and quantity'),
        RobustScaler(),
    ),
    make_pipeline(
        FeatureTypeSelector('description, supplier, uom'),
        CountVectorizer(),
    ),
)
preprocessor.fit_transform(features_train)

然后我得到了这个错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-f8b0db33462a> in <module>()
----> 1 preprocessor.fit_transform(features_train)

C:\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
    500         self._update_transformer_list(transformers)
    501         if any(sparse.issparse(f) for f in Xs):
--> 502             Xs = sparse.hstack(Xs).tocsr()
    503         else:
    504             Xs = np.hstack(Xs)

C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
    462 
    463     """
--> 464     return bmat([blocks], format=format, dtype=dtype)
    465 
    466 

C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
    579                 else:
    580                     if brow_lengths[i] != A.shape[0]:
--> 581                         raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
    582 
    583                 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions

最佳答案

我建议不要对不同的特征类型进行预测然后组合。您最好按照您的建议使用FeatureUnion，它允许您为每种功能类型创建单独的预处理管道。我经常使用的结构如下......

让我们定义一个玩具示例数据集来玩一下:

import pandas as pd

# create a pandas dataframe that contains your features
X = pd.DataFrame({'quantity': [13, 7, 42, 11],
                  'item_name': ['nut', 'bolt', 'bolt', 'chair'],
                  'item_type': ['hardware', 'hardware', 'hardware', 'furniture'],
                  'item_price': [1.95, 4.95, 2.79, 19.95]})

# create corresponding target (this is often just one of the dataframe columns)
y = pd.Series([0, 1, 1, 0], index=X.index)

我使用Pipeline和FeatureUnion(或者更确切地说是它们更简单的快捷方式make_pipeline和make_union)将所有内容粘合在一起:

from sklearn.pipeline import make_union, make_pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression

# create your preprocessor that handles different feature types separately
preprocessor = make_union(
    make_pipeline(
        FeatureTypeSelector('continuous'),
        RobustScaler(),
    ),
    make_pipeline(
        FeatureTypeSelector('categorical'),
        RowToDictTransformer(),
        DictVectorizer(sparse=False),  # set sparse=True if you get MemoryError
    ),
)

# example use of your combined preprocessor
preprocessor.fit_transform(X)

# choose some estimator
estimator = LogisticRegression()

# your prediction model can be created as follows
model = make_pipeline(preprocessor, estimator)

# and training is done as follows
model.fit(X, y)

# predict (preferably not on training data X)
model.predict(X)

在这里，我定义了自己的自定义转换器FeatureTypeSelector和RowToDictTransformer，如下所示:

from sklearn.base import TransformerMixin, BaseEstimator


class FeatureTypeSelector(TransformerMixin, BaseEstimator):
    """ Selects a subset of features based on their type """

    FEATURE_TYPES = {
        'categorical': [
            'item_name',
            'item_type',
        ],
        'continuous': [
            'quantity',
            'item_price',
        ]
    }

    def __init__(self, feature_type):
        self.columns = self.FEATURE_TYPES[feature_type]

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.columns]


class RowToDictTransformer(TransformerMixin, BaseEstimator):
    """ Prepare dataframe for DictVectorizer """

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return (row[1] for row in X.iterrows())

希望这个示例能够更清晰地描绘出如何进行特征联合。

-克里斯

关于python - 不知道如何将 sklearn 与包含文本和数字的特征向量一起使用，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39007083/

文章推荐： machine-learning - Cifar10中的粗标签和细标签是什么？

文章推荐： machine-learning - Caffe 常量乘法层

文章推荐： machine-learning - 高斯簇是线性可分的吗？

javascript - 对焦或模糊时的射击功能(一起)
我想要的是能够在输入获得焦点或失去焦点时执行某些操作(两个事件)。我尝试了以下方法，但这按事件单独工作(单独编码时):仅在焦点上，或仅在失去焦点时。另外，我希望它尽可能跨平台(包括触摸设备)，这是
JavaFX TableView 使用分页过滤(一起)
我分别研究了TableView的Filtering和Pagination。过滤: this帖子帮助我满足了我的需要分页: this , this帖子也帮助了我我想像这样将它们组合在一起: 详情-
TDD 和 UML 一起
我是 TDD 方法的新手，所以我想知道是否有人经历过这种机智可以启发我一点。我想获得一些关于如何一起使用 UML 和 TDD 方法的线索。我已经习惯了:用 UML 设计 --> 生成骨架类(然后保持
Docker 入口点和 cmd 一起
我尝试使用入口点和 cmd 设置 Docker。 FROM debian:stretch RUN apt-get update && \ apt install gnupg ca-certificat
Java 泛型与类和接口(interface) - 一起
我想要一个 Class 对象，但我想强制它所代表的任何类扩展类 A 并实现接口(interface) B。我能做到: Class 或者: Class 但我不能两者兼得。有办法做到这一点吗？最佳答案
javascript - WebStorm + RubyMine 一起(？)
我是 Rubymine 的长期用户。 Rubymine 非常适合基于 html 的 Rails 应用程序，但我现在正在做更多的 SPA 客户端工作(例如 javascript/react)。我发现我真
jquery - Prototype 和 jQuery 一起？
我注意到我使用的某个脚本依赖于原型(prototype)。 (Lightbox 2) 它会与 jQuery 在同一页面上一起工作吗？有没有办法确保它们不冲突？最佳答案可以，但你需要采取 speci
Jquery dataTables 和 tablesorter 一起
我需要对表中显示的数据进行分页并通过 ajax 调用获取它 - 这是我通过使用具有以下配置的 dataTables 插件来完成的 - bServerSide : true; sAjaxSource :
c - 归档和 gtk 一起 - 可能吗？
我是 gtk 新手，所以想知道在 C 语言中归档和 gtk 是否可以一起使用？例如，我可以从 .txt 文件中读取，然后在相同的代码中使用 gtk 在标签或其他内容中显示它吗？如果是，怎么办？谢谢!
java - Bck2Brwsr 与 JavaFX 一起？
有没有人设法得到Bck2Brwsr最近与 Java 8/JavaFX 8 一起工作？有没有兼容的机会？我找不到太多关于它的信息，也没有一个好的起点。使用给定的 Maven archetype我遇到了几
python - openid 和 oauth 一起？
在我的应用程序中，用户通过 openid(与 stackoverflow 相同)登录/注销。我想通过 oauth 向第三方应用程序开放我的应用程序。如何创建我的 openid-consumer 应
java - 与 Spring 一起 hibernate
我在启动和运行 Hibernate 和 Spring 时遇到一些问题。我有一个网络服务器项目，它使用了其他几个具有持久实体的项目。我遇到的问题是，对于存储在 WEB-INF/libs 内的另一个 ja
java - @ControllerAdvice 异常处理与@ResponseStatus 一起
我有 @ControllerAdvice 类，它处理一组异常。我们还有一些其他异常，这些异常用 @ResponseStatus 注释进行注释。为了结合这两种方法，我们使用博客文章中描述的技术:http
android - Progressbar 与 asyncTask 一起
我想在屏幕上使用进度条而不是 progressDialog。我在我的 XML View 文件中插入了一个进度条，我想让它在加载时显示并在不加载时禁用它。所以我使用的是可见的，但它发生了，所以其余的
mysql - CONCAT 与 IF ELSE 一起？
CREATE TABLE `users` ( `id` int(11) AUTO_INCREMENT, `academicdegree` varchar(255),
sql - MySQL - Where IN 与 GROUP_CONCAT 一起
IN() 中使用的查询返回:1, 2。然而，整个查询返回 0 行，这是不可能的，因为它们存在。我在这里做错了什么？ SELECT DISTINCT li.auto_id FROM links
javascript - Jade 和 jQuery 一起
亲们，我如何在使用 Jade 生成的表单上实现 jQuery 样式？我想做的是美化表单并使它们可点击。我在 UI 方面很糟糕。期间。我如何在表单上实现这个可选择的方法？ http://jquer
php - Yii 和 Knockout 一起？
按照目前的情况，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引发辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the
c++ - auto 关键字和 smartpointers 一起？
我可以: auto o1 = new Content; 但不能: std::shared_ptr o1(new Content); std::unique_ptr o1(new Content); 我
java - Firebase 与 sqlite 一起
关闭。这个问题需要更多focused .它目前不接受答案。想改进这个问题吗？更新问题，使其只关注一个问题 editing this post . 关闭 4 年前。 Improve this qu

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 不知道如何将 sklearn 与包含文本和数字的特征向量一起使用