apache-spark - Spark (OneHotEncoder + StringIndexer) = FeatureImportance 如何？-6ren

apache-spark - Spark (OneHotEncoder + StringIndexer) = FeatureImportance 如何？

转载作者：行者123 更新时间：2023-12-03 19:46:26

25

4

当我使用 StringIndexer 和 OneHot Encoder 为我的矩阵准备数据时，我现在如何知道重要特征的名称/来源是什么？

randomForest 分类器只会给我索引，我看不到原始数据的链接:-(

以下代码来自这里:
https://github.com/spark-in-action/first-edition/blob/master/ch08/python/ch08-listings.py

在这个数据集上:
https://github.com/spark-in-action/first-edition/blob/master/ch08/adult.names

我提取了这个代码子集:

$data.take(1)
[Row(age=39.0, occupation=u' State-gov', capital_gain=77516.0, education=u' Bachelors', marital_status=u' Never-married', workclass=u' Adm-clerical', relationship=u' Not-in-family', race=u' White', sex=u' Male', capital_loss=2174.0, fnlwgt=0.0, hours_per_week=40.0, native_country=u' United-States', income=u' <=50K')]

$data2 = indexStringColumns(data, typeString)
$data2.take(1)
>[Row(age=39.0, capital_gain=77516.0, capital_loss=2174.0, fnlwgt=0.0, hours_per_week=40.0, occupation=4.0, education=2.0, marital_status=1.0, workclass=3.0, relationship=1.0, race=0.0, sex=0.0, native_country=0.0, income=0.0)]

$data3 = oneHotEncodeColumns(data2, colString_without_Y)
$data3.take(1)
>[Row(age=39.0, capital_gain=77516.0, capital_loss=2174.0, fnlwgt=0.0, hours_per_week=40.0, income=0.0, occupation=SparseVector(9, {4: 1.0}), education=SparseVector(16, {2: 1.0}), marital_status=SparseVector(7, {1: 1.0}), workclass=SparseVector(15, {3: 1.0}), relationship=SparseVector(6, {1: 1.0}), race=SparseVector(5, {0: 1.0}), sex=SparseVector(2, {0: 1.0}), native_country=SparseVector(42, {0: 1.0}))]

$# modélisation :

$rf          = RandomForestClassifier(labelCol=colY, numTrees=ntree, maxDepth=depth,)
$model       = rf.fit(trainingData)
$predictions = model.transform(testData)

$model.featureImportances
>SparseVector(107, {0: 0.1016, 1: 0.0302, 2: 0.0995, 3: 0.0207, 4: 0.0517, 5: 0.007, 6: 0.0061, 7: 0.0033, 8: 0.0021, 9: 0.0041, 10: 0.0058, 11: 0.0036, 12: 0.0001, 14: 0.0162, 15: 0.0067, 16: 0.0199, 17: 0.0134, 18: 0.0026, 19: 0.0059, 20: 0.0025, 21: 0.0038, 22: 0.0053, 23: 0.0064, 24: 0.003, 25: 0.0014, 26: 0.007, 27: 0.0023, 28: 0.001, 29: 0.0002, 30: 0.1473, 31: 0.0609, 32: 0.0057, 33: 0.0024, 34: 0.0019, 35: 0.001, 36: 0.0002, 37: 0.0258, 38: 0.0054, 39: 0.0244, 40: 0.0045, 41: 0.0055, 42: 0.0186, 43: 0.0061, 44: 0.0021, 45: 0.0043, 46: 0.0029, 47: 0.0046, 48: 0.0024, 49: 0.0019, 50: 0.0001, 51: 0.0, 52: 0.0786, 53: 0.0354, 54: 0.0169, 55: 0.0117, 56: 0.015, 57: 0.0026, 58: 0.0046, 59: 0.0064, 60: 0.0025, 61: 0.0014, 62: 0.0011, 63: 0.007, 64: 0.0312, 65: 0.0048, 66: 0.005, 67: 0.0022, 68: 0.0008, 69: 0.0008, 70: 0.0006, 71: 0.0006, 72: 0.0003, 73: 0.0013, 74: 0.0006, 75: 0.0012, 76: 0.0004, 77: 0.0003, 78: 0.0002, 79: 0.0005, 80: 0.0002, 81: 0.0003, 82: 0.0002, 83: 0.0003, 84: 0.0004, 85: 0.0002, 86: 0.0001, 87: 0.0003, 88: 0.0004, 89: 0.0001, 90: 0.0, 91: 0.0005, 93: 0.0004, 94: 0.0002, 95: 0.0003, 96: 0.0, 97: 0.0001, 98: 0.0001, 99: 0.0001, 100: 0.0, 101: 0.0, 102: 0.0, 103: 0.0, 104: 0.0, 105: 0.0002})

我怎么知道原始数据矩阵中每个索引链接回哪个分类值？

最佳答案

StringIndexerModel.labels 是你所需要的。

例如，

from pyspark.ml.feature import StringIndexer
from pyspark.sql.types import Row

data = sc.parallelize([
  Row(v="A"),
  Row(v="B"), 
]).toDF()

labels = StringIndexer(inputCol="v", outputCol="indexed").fit(data).labels

for idx, v in enumerate(labels):
  print idx, v

OneHotEncoder在这里不是什么大问题，因为它只是将数字转换为索引。注意

The last category is not included by default (configurable via dropLast)

因此，您需要确保值和索引对齐。

关于apache-spark - Spark (OneHotEncoder + StringIndexer) = FeatureImportance 如何？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39289478/

25

4

0

文章推荐： templates - Angular 2 *ngFor 显示组件模板不起作用

文章推荐： json - 在 Svelte 的 main.js 中导入本地 json

文章推荐： Angular6 获取方法响应 "_isScalar":false ,"source"

java - Spark StringIndexer 返回空数据集
Apache Spark StringIndexerModel 在对某一特定列进行转换后返回空数据集。我正在使用成人数据集:http://mlr.cs.umass.edu/ml/datasets/Ad
java - Spark : StringIndexer on sentences
我正在尝试对一列句子执行 StringIndexer 操作，即将单词列表转换为整数列表。例如: 输入数据集: (1, ["I", "like", "Spark"]) (2, ["I", "h
javascript - stringindexer 在本地工作，但在托管站点时不起作用
我的 java 脚本有以下问题。我有jQuery对象。在本地获取其值的第 i 个符号，我使用以下代码 $(this).val()[i]; 当我在服务器中部署此代码时，此行开始抛出异常，表示 $(th
python - PySpark 在嵌套数组中反转 StringIndexer
我正在使用 PySpark 通过 ALS 进行协同过滤。我的原始用户和项目 ID 是字符串，所以我使用了 StringIndexer将它们转换为数字索引(PySpark 的 ALS 模型要求我们这样做
python - Spark MLlib 中的 StringIndexer
我的 PipelinedRDD 中有一列标称值，我希望将其转换为索引编码以用于分类目的。我曾经在pyspark.ml中使用StringIndexer，它非常容易使用。不过，这次我正在学习如何处理 r
python - PySpark 无法访问使用 StringIndexer 添加的列
PySpark - v2.4.0 我尝试将 String 列 Country 转换为 Interger 列 Country_ID，结果看起来不错。但是当我尝试访问 Country_ID 列时，我得到了
apache-spark - 什么是 StringIndexer 、 VectorIndexer 以及如何使用它们？
Dataset dataFrame = ... ; StringIndexerModel labelIndexer = new StringIndexer() .se
apache-spark - 在大型记录上，Spark StringIndexer.fit非常慢
我有格式化为以下示例的大数据记录: // +---+------+------+ // |cid|itemId|bought| // +---+------+------+ // |abc| 12
string - 为什么 Julia 给我 StringIndex 错误？
我收到了 StringIndex我正在处理的 10,000 个字符串中的一个特定字符串的错误。我真的不知道这个字符串有什么问题。我想这可能是一个特殊的性格问题。如果我 println然后将该字符串分
apache-spark - Spark，ML，StringIndexer:处理看不见的标签
我的目标是建立一个multicalss分类器。我已经建立了用于特征提取的管道，并且第一步包括StringIndexer转换器，将每个类名称映射到标签，该标签将在分类器训练步骤中使用。管道已安装培训
scala - 检索 Spark Mllib StringIndexer 列映射
如何从经过训练的 Spark MLlib StringIndexerModel 中获取映射？ val stringIndexer = new StringIndexer() .setInput
python - 使用来自 StringIndexer 的标签进行 IndexToString 转换
如何通过从 labelIndexer 获取标签，使用 IndexToString 进行转换？ labelIndexer = StringIndexer(inputCol="shutdown_reaso
scala - 在 Spark StringIndexer 中处理 NULL 值
我有一个包含一些分类字符串列的数据集，我想用 double 类型表示它们。我使用 StringIndexer 进行此转换并且它有效，但是当我在另一个具有 NULL 值的数据集中尝试它时，它给出了 ja
apache-spark - Spark (OneHotEncoder + StringIndexer) = FeatureImportance 如何？
当我使用 StringIndexer 和 OneHot Encoder 为我的矩阵准备数据时，我现在如何知道重要特征的名称/来源是什么？ randomForest 分类器只会给我索引，我看不到原始数据
apache-spark - Spark ML StringIndexer 不同标签训练/测试
我正在使用 Scala 并使用 StringIndexer 为训练集中的每个类别分配索引。它根据每个类别的频率分配索引。问题是在我的测试数据中，类别的频率不同，因此 StringIndexer 为类
python - 从 Spark (pyspark) 管道内的 StringIndexer 阶段获取标签
我正在使用 Spark 和 pyspark 并且我有一个 pipeline 设置了一堆 StringIndexer 对象，我用它来将字符串列编码为索引列: indexers = [StringInde
python - 将 StringIndexer 应用于 PySpark Dataframe 中的多个列
我有一个 PySpark 数据框 +-------+--------------+----+----+ |address| date|name|food| +-------+----
apache-spark - 在 Spark ML 中，为什么在具有数百万个不同值的列上安装 StringIndexer 会产生 OOM 错误？
我正在尝试在具有大约 15.000.000 个唯一字符串值的列上使用 Spark 的 StringIndexer 特征转换器。无论我投入多少资源，Spark 总是会因某种内存不足异常而死在我身上。 f

首页

博学

6Ren·AI

商城

apache-spark - Spark (OneHotEncoder + StringIndexer) = FeatureImportance 如何？