python - 将 MultiLabelBinarizer 打包到 scikit-learn Pipeline 中以推理新数据-6ren

python - 将 MultiLabelBinarizer 打包到 scikit-learn Pipeline 中以推理新数据

转载作者：太空宇宙更新时间：2023-11-03 20:24:46

我正在构建一个多标签分类器来根据文本字段预测标签。例如，根据电影标题预测类型。我想使用 MultiLabelBinarizer() 对包含所有适用流派标签的列进行二值化。例如，['action','comedy','drama'] 被分成具有 0/1 值的三列。

我使用 MultiLabelBinarizer() 的原因是我可以使用内置的 inverse_transform() 函数来转换输出数组(例如 array([0, 0, 1, 0, 1]) 直接转换为用户友好的文本输出 (['action','drama'])。

分类器可以工作，但我在预测新数据时遇到问题。我找不到将 MultiLabelBinarizer() 集成到我的管道中的方法，以便可以保存并重新加载它以推断新数据。一种解决方案是将其单独保存为 pickle 对象并每次加载回来，但我想避免在生产中出现这种依赖关系。

我知道这与我在管道中内置的 tf-idf 向量类似，但不同之处在于它应用于目标列(流派标签)而不是我的自变量(文本注释)。这是我训练多标签 SVM 的代码:

def svm_train(df):  
  mlb = MultiLabelBinarizer()
  y = mlb.fit_transform(df['Genres'])

  with mlflow.start_run():
    x_train, x_test, y_train, y_test = train_test_split(df['Movie Title'], y, test_size=0.3)

    # Instantiate TF-IDF Vectorizer and SVM Model
    tfidf_vect = TfidfVectorizer()
    mdl = OneVsRestClassifier(LinearSVC(loss='hinge'))
    svm_pipeline = Pipeline([('tfidf', tfidf_vect), ('clf', mdl)])

    svm_pipeline.fit(x_train, y_train)
    prediction = svm_pipeline.predict(x_test)

    report = classification_report(y_test, prediction, target_names=mlb.classes_)

    mlflow.sklearn.log_model(svm_pipeline, "Multilabel Classifier")
    mlflow.log_artifact(mlb, "MLB")

  return(report)

svm_train(df)

推理包括在单独的 Databricks 笔记本中从 MLflow 重新加载保存的模型(与加载回 pickle 文件相同)并使用管道进行预测:

def predict_labels(new_data):
  model_uri = '...MLflow path...'
  model = mlflow.sklearn.load_model(model_uri)
  predictions = model.predict(new_data)
  # If I can't package the MultiLabelBinarizer() into the Pipeline, this 
  # is where I'd have to load the pickle object mlb
  # so that I can inverse_transform()
  return mlb.inverse_transform(predictions)

new_data = ['Some movie title']
predict_labels(new_data)

['action','comedy']

这是我正在使用的所有库:

import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import glob, os
from pyspark.sql import DataFrame
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score

最佳答案

对于您的用例，您可能需要考虑使用 MLflow's functionality for persisting custom models 。根据the docs :

While MLflow’s built-in model persistence utilities are convenient for packaging models from various popular ML libraries in MLflow Model format, they do not cover every use case. For example, you may want to use a model from an ML library that is not explicitly supported by MLflow’s built-in flavors. Alternatively, you may want to package custom inference code and data to create an MLflow Model. Fortunately, MLflow provides two solutions that can be used to accomplish these tasks: Custom Python Models and Custom Flavors.

特别是，您应该能够以类似于链接示例中的 XGBoost 模型的方式将 MultiLabelIndexer 作为工件与 Sklearn 模型一起记录，然后在预测时将其加载回来，如下所示:

# Save sklearn model & multilabel indexer to paths on the local filesystem
sklearn_model_path = "some/local/path"
labelindexer_path = "another/local/path"
# ... save your models objects here to sklearn_model_path and labelindexer_path

# Define the custom model class
import mlflow.pyfunc
class SklearnWrapper(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        import pickle, mlflow
        with open(context["indexer_path"], 'rb') as handle:
            self.indexer = pickle.load(handle)
        self.pipeline = mlflow.sklearn.load_model("pipeline_path")

    def predict(self, context, model_input):
        pipeline_preds = self.pipeline.predict(model_input)
        return self.indexer.inverse_transform(pipeline_preds)

# Create a Conda environment for the new MLflow Model that contains the XGBoost library
# as a dependency, as well as the required CloudPickle library
import cloudpickle
import sklearn
conda_env = {
    'channels': ['defaults'],
    'dependencies': [
      'sklearn={}'.format(sklearn.__version__),
      'cloudpickle={}'.format(cloudpickle.__version__),
    ],
    'name': 'sklearn_env'
}

# Save the MLflow Model
artifacts = {
    "pipeline_path": sklearn_model_path,
    "indexer_path": labelindexer_path,
}
mlflow_pyfunc_model_path = "sklearn_mlflow_pyfunc"
mlflow.pyfunc.save_model(
        path=mlflow_pyfunc_model_path, python_model=XGBWrapper(), artifacts=artifacts,
        conda_env=conda_env)

# Load the model in `python_function` format
loaded_model = mlflow.pyfunc.load_model(mlflow_pyfunc_model_path)
# Predict on a pandas DataFrame
import pandas as pd
loaded_model.predict(pd.DataFrame(...))

请注意，我们的自定义模型仍会加载回 MultiLabelIndexer，但 MLflow 会将索引器与您的管道和自定义模型逻辑一起保留，以便您可以将模型视为用于生产部署的单个连贯单元。

关于python - 将 MultiLabelBinarizer 打包到 scikit-learn Pipeline 中以推理新数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57924929/

文章推荐： html - 如何让伪元素尊重直接父元素的宽度并忽略兄弟元素的宽度

文章推荐： javascript - 在 HTML5 Canvas 中定位文本

文章推荐： javascript - 如何防止谷歌浏览器输入建议？

文章推荐： python - 如何将传说保留在情节中？

python - MultiLabelBinarizer 以字母而不是类别输出类
我有一个数据框，其中一列是 short_names。 short_names 由 2-5 个字母组成 => BG,OP,LE,WEL，LC。每行可以有任意数量的名称。我正在尝试使用 MultiLab
python - MultiLabelBinarizer 不适用于具有多个数组的列
我有一列包含 15000 个数组。请从 15000 条记录中找到 2 条此类记录的样本。我想为 Genres_relevant 下的值创建虚拟值。 user Genres_relevant 1
python - MultiLabelBinarizer 可以表示值的计数吗？
假设我们在数据框列中有列表 df['a'][0] = ['earth','mars','earth','moon'] df['a'][1] = ['jupiter','pluto','sun'] 有没
python - 在标签不在训练集中的测试数据上使用 MultilabelBinarizer
给定这个简单的多标签分类示例(取自这个问题，use scikit-learn to classify into multiple categories) import numpy as np from
python - 将数组转换为 MultiLabelBinarizer 的列表
我有以下数组:“['book'，'read']”“['cup'，'drink']”等，我想将其转换为列表这将允许我申请 MultiLabelBinarizer . 目前它要么给我单个字符，要么只输出
python - 转换 pandas 数据框以用于 MultiLabelBinarizer
我的问题是:我怎样才能像这样转换数据框以最终在 scikit 的 MulitLabelBinarizer 中使用它: d1 = {'ID':[1,2,3,4], 'km':[80,90,90,100]
python - sklearn ColumnTransformer 与 MultilabelBinarizer
我想知道是否可以在 ColumnTransformer 中使用 MultilabelBinarizer。我有一个玩具 Pandas 数据框，例如: df = pd.DataFrame({"id":[
python - 使用 MultiLabelBinarizer python 进行解码
我的目标是一个如下所示的数据帧，使用 MultiLabelBinarizer 对我的数据帧(一次每一行)进行编码效果很好，而解码总是以错误的顺序输出数据。这是一个简单的数据框(我的目标 y): in
python - 用于生产时的 Sklearn MultiLabelBinarizer() 错误
编辑:我已将代码从 mlb 更改为 TfIdfVectorizer()。我仍然面临一个问题。请看下面我的代码。 from sklearn.externals import joblib from sk
python-3.x - 反转 MultiLabelBinarizer 以在列中创建列表
在 Python3 中，我有一个多标签二进制数据格式的起始数据框: df1: "a" "b" "c" "d" "e" 1 1 0 0 1 0 0 1 0 1 1
python-3.x - 在 MultiLabelBinarizer 中获取计数
如何获取 MultiLabelBinarizer 中的项目计数？ import pandas as pd from sklearn.preprocessing import MultiLabelBin
python - sklearn - 无法立即调用 MultiLabelBinarizer 的 inverse_transform
在实例化 MultiLabelBinarizer 之后，我需要它的 inverse_transform 方法来处理我在别处构建的矩阵。不幸的是， import numpy as np from skl
scikit-learn - Scikit 学习多标签分类，从 MultiLabelBinarizer 获取标签
在多标签分类问题中，我使用 MultiLabelBinarizer 将 20 个文本标签转换为 0 和 1 的二进制列表。预测后，我得到了 20 个二进制值的列表，我想输出相应的文本标签。我只是想
python - 将 MultiLabelBinarizer 打包到 scikit-learn Pipeline 中以推理新数据
我正在构建一个多标签分类器来根据文本字段预测标签。例如，根据电影标题预测类型。我想使用 MultiLabelBinarizer() 对包含所有适用流派标签的列进行二值化。例如，['action','c

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 将 MultiLabelBinarizer 打包到 scikit-learn Pipeline 中以推理新数据