gpt4 book ai didi

apache-spark - Spark 随机森林分类器 numClasses

转载 作者:行者123 更新时间:2023-12-04 04:17:22 24 4
gpt4 key购买 nike

像这样训练一个 RandomForest (Spark 1.6.0)

val numClasses = 4 // 0-2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 9
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 6
val maxBins = 32

val model = RandomForest.trainClassifier(trainRDD, numClasses,
categoricalFeaturesInfo, numTrees,
featureSubsetStrategy, impurity,
maxDepth, maxBins)

输入标签:

labels = labeledRDD.map(lambda lp: lp.label).distinct().collect()
for label in sorted(labels):
print label

0.0
1.0
2.0

但输出只包含两个类:

metrics = MulticlassMetrics(labelsAndPredictions)
df_confusion = metrics.confusionMatrix()
display_cm(df_confusion)

输出:

83017.0  81.0    0.0
8703.0 2609.0 0.0
10232.0 255.0 0.0

当我在 pyspark 中加载相同模型并针对其他数据(上述部分)运行它时的输出

DenseMatrix([[  1.75280000e+04,   3.26000000e+02],
[ 3.00000000e+00, 1.27400000e+03]])

最佳答案

它变得更好了...我使用 pearson correlation 来找出哪些列没有任何相关性。删除十个最低相关列,现在我得到了好的结果:

enter image description here

Test Error = 0.0401823
precision = 0.959818
Recall = 0.959818

ConfusionMatrix([[ 17323., 0., 359.],
[ 0., 1430., 92.],
[ 208., 170., 1049.]])

enter image description here

关于apache-spark - Spark 随机森林分类器 numClasses,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36483140/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com