gpt4 book ai didi

pyspark randomForest 特征重要性 : how to get column names from the column numbers

转载 作者:行者123 更新时间:2023-12-04 12:39:40 25 4
gpt4 key购买 nike

我在spark中使用标准(字符串索引器+一个热编码器+随机森林)管道,如下图

labelIndexer = StringIndexer(inputCol = class_label_name, outputCol="indexedLabel").fit(data)

string_feature_indexers = [
StringIndexer(inputCol=x, outputCol="int_{0}".format(x)).fit(data)
for x in char_col_toUse_names
]

onehot_encoder = [
OneHotEncoder(inputCol="int_"+x, outputCol="onehot_{0}".format(x))
for x in char_col_toUse_names
]
all_columns = num_col_toUse_names + bool_col_toUse_names + ["onehot_"+x for x in char_col_toUse_names]
assembler = VectorAssembler(inputCols=[col for col in all_columns], outputCol="features")
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features", numTrees=100)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer] + string_feature_indexers + onehot_encoder + [assembler, rf, labelConverter])

crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=3)
cvModel = crossval.fit(trainingData)

现在在拟合之后我可以使用 cvModel.bestModel.stages[-2].featureImportances 获得随机森林和特征重要性,但这并没有给我特征/列名,而只是特征编号。

我得到的如下:

print(cvModel.bestModel.stages[-2].featureImportances)

(1446,[3,4,9,18,20,103,766,981,983,1098,1121,1134,1148,1227,1288,1345,1436,1444],[0.109898803421,0.0967396441648,4.24568235244e-05,0.0369705839109,0.0163489685127,3.2286694534e-06,0.0208192703688,0.0815822887175,0.0466903663708,0.0227619959989,0.0850922269211,0.000113388896956,0.0924779490403,0.163835022713,0.118987129392,0.107373548367,3.35577640585e-05,0.000229569946193])

如何将其映射回某些列名或列名 + 值格式?
基本上是为了获得随机森林的特征重要性以及列名。

最佳答案

转换后的数据集元数据具有所需的属性。这是一个简单的方法 -

  • 创建一个 Pandas 数据框(通常特征列表不会很大,所以在存储 Pandas DF 时没有内存问题)
    pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"] 
    ["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
  • 然后创建一个广播字典来映射。在分布式环境中广播是必要的。
    feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) 

    feature_dict_broad = sc.broadcast(feature_dict)

  • 您也可以看看 herehere

    关于pyspark randomForest 特征重要性 : how to get column names from the column numbers,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45024192/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com