gpt4 book ai didi

pyspark - 如何将从逻辑回归模型获得的系数映射到pyspark中的特征名称

转载 作者:行者123 更新时间:2023-12-05 06:26:36 24 4
gpt4 key购买 nike

我使用流向数据 block 所列模型的管道流构建了一个逻辑回归模型。 https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html

特征(数字和字符串特征)使用 OneHotEncoderEstimator 进行编码,然后使用标准缩放器进行转换。

我想知道如何将从逻辑回归中获得的权重(系数)映射到原始数据框中的特征名称。

换句话说,如何从模型得到的权重或系数得到对应的特征

谢谢

我试图从 lrModel.schema 中提取特征,它给出了显示特征的 structField 列表

我尝试从模式中提取特征并映射到权重但没有成功

from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="scaledFeatures", maxIter=10)

# Train model with Training Data

lrModel = lr.fit(trainingData)

predictions = lrModel.transform(trainingData)

LRschema = predictions.schema

提取元组列表的预期结果(特征权重,特征名称)

最佳答案

不是 LogisticRegression 的直接输出,但可以使用我使用的以下函数获得:

def ExtractFeatureCoeficient(model, dataset, excludedCols = None):
test = model.transform(dataset)
weights = model.coefficients
print('This is model weights: \n', weights)
weights = [(float(w),) for w in weights] # convert numpy type to float, and to tuple
if excludedCols == None:
feature_col = [f for f in test.schema.names if f not in ['y', 'classWeights', 'features', 'label', 'rawPrediction', 'probability', 'prediction']]
else:
feature_col = [f for f in test.schema.names if f not in excludedCols]
if len(weights) == len(feature_col):
weightsDF = sqlContext.createDataFrame(zip(weights, feature_col), schema= ["Coeficients", "FeatureName"])
else:
print('Coeficients are not matching with remaining Fetures in the model, please check field lists with model.transform(dataset).schema.names')

return weightsDF

results = ExtractFeatureCoeficient(lr_model, trainingData)

results.show()

This will generated a spark dataframe with following fields:

+--------------------+--------------------+
| Coeficients| FeatureName|
+--------------------+--------------------+
|[0.15834847825223...| name |
| [0.0]| lat |
+--------------------+--------------------+

Or you can fit a GML model as follow:

model = GeneralizedLinearRegression(family="binomial", link="logit", featuresCol="features", labelCol="label", maxIter = 1000, regParam = 0.8, weightCol="classWeights")

# Train model. This also runs the indexer.
models = glmModel.fit(trainingData)

# then get summary of the model:

summary = model.summary
print(summary)

生成输出:

Coefficients:
Feature Estimate Std Error T Value P Value
(Intercept) -1.3079 0.0705 -18.5549 0.0000
name 0.1248 0.0158 7.9129 0.0000
lat 0.0239 0.0209 1.1455 0.2520

关于pyspark - 如何将从逻辑回归模型获得的系数映射到pyspark中的特征名称,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55971296/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com