apache-spark - Spark.ml 回归不计算与 scikit-learn 相同的模型-6ren

apache-spark - Spark.ml 回归不计算与 scikit-learn 相同的模型

转载作者：行者123 更新时间：2023-12-04 10:23:18

我在 scikit-learn 和 spark.ml 中设置了一个非常简单的逻辑回归问题，结果出现分歧:他们学习的模型不同，但我不知道为什么(数据相同，模型类型是一样，正则化是一样的......)。

毫无疑问，我在一侧或另一侧缺少一些设置。哪个设置？我应该如何设置 scikit 或 spark.ml 以找到与其对应的模型相同的模型？

我在下面给出了 sklearn 代码和 spark.ml 代码。两者都应该准备好剪切和粘贴并运行。

scikit-learn 代码:

import numpy as np
from sklearn.linear_model import LogisticRegression, Ridge

X = np.array([
    [-0.7306653538519616, 0.0],
    [0.6750417712898752, -0.4232874171873786],
    [0.1863463229359709, -0.8163423997075965],
    [-0.6719842051493347, 0.0],
    [0.9699938346531928, 0.0],
    [0.22759406190283604, 0.0],
    [0.9688721028330911, 0.0],
    [0.5993795346650845, 0.0],
    [0.9219423508390701, -0.8972778242305388],
    [0.7006904841584055, -0.5607635619919824]
])

y = np.array([
    0.0,
    1.0,
    1.0,
    0.0,
    1.0,
    1.0,
    1.0,
    0.0,
    0.0,
    0.0
])

m, n = X.shape

# Add intercept term to simulate inputs to GameEstimator
X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis]))

l = 0.3
e = LogisticRegression(
    fit_intercept=False,
    penalty='l2',
    C=1/l,
    max_iter=100,
    tol=1e-11)

e.fit(X_with_intercept, y)

print e.coef_
# => [[ 0.98662189  0.45571052 -0.23467255]]

# Linear regression is called Ridge in sklearn
e = Ridge(
    fit_intercept=False,
    alpha=l,
    max_iter=100,
    tol=1e-11)

e.fit(X_with_intercept, y)

print e.coef_
# =>[ 0.32155545  0.17904355  0.41222418]

spark.ml 代码:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.sql.SQLContext

object TestSparkRegression {
  def main(args: Array[String]): Unit = {
    import org.apache.log4j.{Level, Logger}

    Logger.getLogger("org").setLevel(Level.OFF)
    Logger.getLogger("akka").setLevel(Level.OFF)

    val conf = new SparkConf().setAppName("test").setMaster("local")
    val sc = new SparkContext(conf)

    val sparkTrainingData = new SQLContext(sc)
      .createDataFrame(Seq(
        LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)),
        LabeledPoint(1.0, Vectors.dense(0.6750417712898752, -0.4232874171873786)),
        LabeledPoint(1.0, Vectors.dense(0.1863463229359709, -0.8163423997075965)),
        LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)),
        LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)),
        LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)),
        LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)),
        LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)),
        LabeledPoint(0.0, Vectors.dense(0.9219423508390701, -0.8972778242305388)),
        LabeledPoint(0.0, Vectors.dense(0.7006904841584055, -0.5607635619919824))))
      .toDF("label", "features")

    val logisticModel = new LogisticRegression()
      .setRegParam(0.3)
      .setLabelCol("label")
      .setFeaturesCol("features")
      .fit(sparkTrainingData)

    println(s"Spark logistic model coefficients: ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}")
    // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713] Intercept: -0.13897955358689987

    val linearModel = new LinearRegression()
      .setRegParam(0.3)
      .setLabelCol("label")
      .setFeaturesCol("features")
      .setSolver("l-bfgs")
      .fit(sparkTrainingData)

    println(s"Spark linear model coefficients: ${linearModel.coefficients} Intercept: ${linearModel.intercept}")
    // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802] Intercept: 0.45464906876832323

    sc.stop()
  }
}

最佳答案

您需要执行以下操作:

首先标准化 python 和 spark 数据帧。 Spark 内部默认使用标准化。注意考虑两个包中标准缩放器实现中标准偏差公式的差异。

对于逻辑回归，Spark 使用对数损失的平均值(分母为权重之和，即所有权重为 1 时的训练实例数)，而 sklearn 使用对数损失之和。在线性回归中，与 sklearn 不同，spark 在误差平方和项中使用 1/2n 因子。 Spark 正则化需要相应地缩小 - 在此示例中，逻辑回归为 1/10 倍，线性回归为 1/20 倍。

Scikit-learn 代码

import numpy as np
from sklearn.linear_model import LogisticRegression, Ridge

X = np.array([
    [-0.7306653538519616, 0.0],
    [0.6750417712898752, -0.4232874171873786],
    [0.1863463229359709, -0.8163423997075965],
    [-0.6719842051493347, 0.0],
    [0.9699938346531928, 0.0],
    [0.22759406190283604, 0.0],
    [0.9688721028330911, 0.0],
    [0.5993795346650845, 0.0],
    [0.9219423508390701, -0.8972778242305388],
    [0.7006904841584055, -0.5607635619919824]
])

y = np.array([
    0.0,
    1.0,
    1.0,
    0.0,
    1.0,
    1.0,
    1.0,
    0.0,
    0.0,
    0.0
])

m, n = X.shape


from sklearn.preprocessing import StandardScaler

## sqrt(n-1)/sqrt(n) factor for getting the same standardization as spark
Xsc=StandardScaler().fit_transform(X)*3.0/np.sqrt(10.0)

l = 0.3
e = LogisticRegression(
    fit_intercept=True,
    penalty='l2',
    C=1/l,
    max_iter=100,
    tol=1e-11,
    solver='lbfgs',verbose=1)

e.fit(Xsc, y)

print e.coef_, e.intercept_
# => [[ 0.82122437 0.32615256]] [-0.01181534]

#e.get_params(deep=True)

# Linear regression is called Ridge in sklearn
e = Ridge(
    fit_intercept=True,
    alpha=l,
    max_iter=100,
    tol=1e-11)

e.fit(Xsc, y)

print e.coef_,e.intercept_
# =>[ 0.21310109 0.09203616] 0.5

Spark 代码(重构为使用 ML API)

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.feature.StandardScaler

val sparkTrainingData_orig = new SQLContext(sc).
  createDataFrame(Seq(
    (0.0, Vectors.dense(Array(-0.7306653538519616, 0.0))),
    (1.0, Vectors.dense(Array(0.6750417712898752, -0.4232874171873786))),
    (1.0, Vectors.dense(Array(0.1863463229359709, -0.8163423997075965))),
    (0.0, Vectors.dense(Array(-0.6719842051493347, 0.0))),
    (1.0, Vectors.dense(Array(0.9699938346531928, 0.0))),
    (1.0, Vectors.dense(Array(0.22759406190283604, 0.0))),
    (1.0, Vectors.dense(Array(0.9688721028330911, 0.0))),
    (0.0, Vectors.dense(Array(0.5993795346650845, 0.0))),
    (0.0, Vectors.dense(Array(0.9219423508390701, -0.8972778242305388))),
    (0.0, Vectors.dense(Array(0.7006904841584055, -0.5607635619919824))))).
  toDF("label", "features_orig")

val sparkTrainingData=new StandardScaler().
  setWithMean(true).
  setInputCol("features_orig").
  setOutputCol("features").
  fit(sparkTrainingData_orig).
  transform(sparkTrainingData_orig)

//Make regularization 0.3/10=0.03
val logisticModel = new LogisticRegression().
  setRegParam(0.03).
  setLabelCol("label").
  setFeaturesCol("features").
  setTol(1e-12).
  setMaxIter(100).
  fit(sparkTrainingData)

println(s"Spark logistic model coefficients: ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}")
// Spark logistic model coefficients: [0.8212244419577079,0.32615245441495727] Intercept: -0.011815325216668142

//Make regularization 0.3/20=0.015    
val linearModel = new LinearRegression().
  setRegParam(0.015).
  setLabelCol("label").
  setFeaturesCol("features").
  setTol(1e-12).
  setMaxIter(100).
  fit(sparkTrainingData)

println(s"Spark linear model coefficients: ${linearModel.coefficients} Intercept: ${linearModel.intercept}")
// Spark linear model coefficients: [0.21394341729353747,0.09257340293212045] Intercept: 0.5

关于apache-spark - Spark.ml 回归不计算与 scikit-learn 相同的模型，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42729431/

文章推荐： logic - 任何人都有以下难题的逻辑

文章推荐： 3d - 使用开源 3D 引擎从 Openstreetmap 数据渲染 map ？

文章推荐： amazon-web-services - AWS CloudFront杠杆浏览器缓存

ml - 将运算符传递给 ML 中的函数
如何将运算符传递给 ML 中的函数？例如，考虑这个伪代码: function (int a, int b, operator op) return a op b 这里，运算符可以是 op +
google-cloud-ml - 在谷歌云 ML 中运行作业后出错
我尝试在 Google Cloud ML 上运行来自 github 的 word-RNN 模型。提交作业后，我在日志文件中收到错误。这是我提交的训练内容 gcloud ml-engine jobs
ocaml - 如何在另一个 .ml 文件中访问一个 .ml 文件中定义的类型
在 a.ml 中定义了一个记录类型 t 并且也是透明地定义的在 a.mli 中，即在 d 接口(interface)中，以便类型定义可用到所有其他文件。 a.ml 还有一个函数 func，它返回一
ml.net - 有没有办法重新打开 ML.NET 模型生成器？
关闭 ML.NET 模型生成器后，是否可以为创建的模型重新打开它？我可以删除创建的模型并重新开始，但这并不理想。最佳答案不，不是真的。 AutoML/Model Builder 可以生成代码并将
ml.net - 使用 ML.NET 训练模型时在空字符串上使用占位符
我有一个关于训练可以预测名称是否为女性的 ML.NET 的问题。该模型可以使用这样的管道进行训练: var mlContext = new MLContext(); IDataView trainin
ml.net - 如何在中间件(ML.NET)中将模型添加到PredictionEnginePool？
我在 ASP.NET Core 应用程序中使用 ML.NET，并在 Startup 中使用以下代码: var builder = services.AddPredictionEnginePool();
Python ML - 如何最好地拯救 python ml 值数组
我使用 sklearn 创建了一个模型进行分类。当我调用函数 y_pred2 = clf.predict (features2) 时，它会返回一个包含我的预测的所有 id 的列表 y_pred2 =
google-cloud-ml - Cloud-ML 作业没有这样的文件或目录
我已向 cloud ml 提交了训练作业。但是，它找不到 csv 文件。它就在桶里。这是代码。 # Use scikit-learn to grid search the batch size and
azure - Databricks 运行时 ML 和 ML 流程之间的区别
我是 Azure Databricks 的新手，尽管我在 Databricks 方面有很好的经验，但仅限于 Data Engg 方面。我对 Databricks Runtime ML 和 ML Flo
google-cloud-ml - 无法部署 Cloud ML 模型
为什么我尝试将经过训练的模型部署到 Google Cloud ML，却收到以下错误: Create Version failed.Model validation failed: Model meta
azure - Databricks 运行时 ML 和 ML 流程之间的区别
我是 Azure Databricks 的新手，尽管我在 Databricks 方面有很好的经验，但仅限于 Data Engg 方面。我对 Databricks Runtime ML 和 ML Flo
azure - Azure ML 和 Azure ML 实验之间的区别
我是 Azure ML 新手。我有一些疑问。有人可以澄清下面列出的我的疑问吗？ Azure ML 服务与 Azure ML 实验服务之间有什么区别。 Azure ML 工作台和 Azure ML St
google-cloud-ml-engine - 如何计算 Cloud ML 作业的成本？
我的 Cloud ML 训练作业已完成，输出如下: "consumedMLUnits": 43.24 我如何使用此信息来确定培训工作的成本？我无法在以下两个选项之间做出决定: 1)根据这个page ，
google-cloud-ml - Google Cloud ML Tensorflow 版本
docs for setting up Google Cloud ML建议安装 Tensorflow 版本 r0.11。我观察到 r0.12 中新提供的 TensorFlow 函数在 Cloud ML
apache-spark-ml - 如何从 Spark ML Logistic 回归模型中获取模型摘要？
我正在关注一个来自 - https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#multinomial-logist
sml - 标准 ML : how to compile a ML program using mosmlc?
我想使用 mosmlc 将我的 ML 程序编译成可执行二进制文件。但是，我找不到太多关于如何操作的信息。我想编译的代码在这里http://people.pwf.cam.ac.uk/bt288/tic
azure - 从另一个 Azure ML 工作区访问 Azure ML 模型注册表
假设我有两个 Azure ML 工作区: Workspace1 - 由一个团队(Team1)使用，该团队仅训练模型并将模型存储在 Workspace1 的模型注册表中 Workspace2 - 由另一
azure - 设置 azure ml 时加载命令模块 azure ml 时出错
我尝试使用以下命令行在 Azure 上的 Linux(Ubuntu) 数据科学虚拟机上设置我的 Azure 机器学习环境: az ml 环境设置但是，它显示错误为加载命令模块 ml 时出错。一直在谷
azure - 从另一个 Azure ML 工作区访问 Azure ML 模型注册表
假设我有两个 Azure ML 工作区: Workspace1 - 由一个团队(Team1)使用，该团队仅训练模型并将模型存储在 Workspace1 的模型注册表中 Workspace2 - 由另一
azure - 设置 azure ml 时加载命令模块 azure ml 时出错
我尝试使用以下命令行在 Azure 上的 Linux(Ubuntu) 数据科学虚拟机上设置我的 Azure 机器学习环境: az ml 环境设置但是，它显示错误为加载命令模块 ml 时出错。一直在谷

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

apache-spark - Spark.ml 回归不计算与 scikit-learn 相同的模型