gpt4 book ai didi

apache-spark - PySpark 中的 CrossValidator 是否分发执行?

转载 作者:行者123 更新时间:2023-11-30 08:34:48 25 4
gpt4 key购买 nike

我正在 PySpark 中进行机器学习并使用 RandomForestClassifier。到目前为止我一直在使用Sklearn。我正在使用 CrossValidator 来调整参数并获得最佳模型。下面是从 Spark 网站获取的示例代码。

从我读到的内容来看,我不明白spark是否也分布参数调整,或者与Sklearn的GridSearchCV的情况相同。

任何帮助将非常感激。

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Prepare training documents, which are labeled.
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()

crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2) # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

最佳答案

Spark 2.3+

SPARK-21911包括并行模型拟合。并行级别由 parallelism Param 控制。

Spark <2.3

事实并非如此。交叉验证以简单的 nested for loop 形式实现:

for i in range(nFolds):
...
for j in range(numModels):
...

只有训练各个模型的过程是分布式的。

关于apache-spark - PySpark 中的 CrossValidator 是否分发执行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45806369/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com