gpt4 book ai didi

python - Hyperopt 与 Spark MlLib 集成

转载 作者:行者123 更新时间:2023-12-04 04:17:46 26 4
gpt4 key购买 nike

有没有人有将 Hyperopt 集成到 Spark 的 MlLib 中的好例子?我一直在尝试在 Databricks 上这样做,并继续遇到同样的错误。我不确定这是否是我的目标函数的问题,或者它是否与 pyspark 上的 Spark ML 以及它如何连接到 Databricks 有关。

import itertools
from pyspark.sql import functions as f
from pyspark.sql import DataFrame
from pyspark.sql.types import *

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import OneHotEncoder, Imputer, VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression, GBTClassifier
from pyspark.ml.classification import GBTClassificationModel
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, CrossValidatorModel
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import numpy as np
from itertools import product
from hyperopt import fmin, hp, tpe, STATUS_OK, SparkTrials
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

search_space ={'maxDepth' : hp.choice("maxDepth", np.arange(3, 8, dtype=int)),
'maxIter' : hp.uniform("maxIter", 200,800),
'featureSubsetStrategy' : str(hp.quniform("featureSubsetStrategy", .5,1,.1)),
'minInstancesPerNode' : hp.uniform("min_child_weight", 1,10),
'stepSize' : hp.loguniform('stepSize', np.log(0.01), np.log(0.1)),
'subsamplingRate' : hp.quniform("featureSubsetStrategy", .5,1,.1)
}
evaluator = BinaryClassificationEvaluator(labelCol="positive")

def train(params):
gbtModel = GBTClassifier(labelCol="positive", featuresCol="features").fit(train)
predictions_val = gbtModel.predict(val.map(lambda x: x.features))
labelsAndPredictions = val.map(lambda lp: lp.label).zip(predictions_val)
ROC = evaluator.evaluate(predictions_val, {evaluator.metricName: "areaUnderROC"})

return {'ROC': ROC, 'status': STATUS_OK}



N_HYPEROPT_PROBES = 1000 #can increase, keep small for testing
EARLY_STOPPING = 50
HYPEROPT_ALGO = tpe.suggest
NB_CV_FOLDS = 5 # for testing, can increase

obj_call_count = 0
cur_best_score = 1000000
spark_trials = SparkTrials(parallelism=4)
best = fmin(fn=train,
space=search_space,
algo=HYPEROPT_ALGO,
max_evals=N_HYPEROPT_PROBES,
trials=spark_trials,
verbose=1)

运行后出现以下错误:

总试验:0:0 次成功,0 次失败,0 次取消。 py4j.Py4JException: 方法 __getstate__([]) 不存在

最佳答案

不确定这是否为时已晚,但 SparkTrials 仅适用于单机 ML 模型,例如 scikit-learn 库中的模型。对于 Spark MLib,您应该使用 Trials(您不需要将 trials 参数传递给 fmin 函数)

您可以在此处找到更多详细信息: http://hyperopt.github.io/hyperopt/scaleout/spark/

Since SparkTrials fits and evaluates each model on one Spark worker, it is limited to tuning single-machine ML models and workflows, such as scikit-learn or single-machine TensorFlow. For distributed ML algorithms such as Apache Spark MLlib or Horovod, you can use Hyperopt’s default Trials class.

关于python - Hyperopt 与 Spark MlLib 集成,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60213506/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com