gpt4 book ai didi

java - Spark StringIndexer 返回空数据集

转载 作者:行者123 更新时间:2023-12-01 19:07:45 25 4
gpt4 key购买 nike

Apache Spark StringIndexerModel 在对某一特定列进行转换后返回空数据集。我正在使用成人数据集:http://mlr.cs.umass.edu/ml/datasets/Adult

第1步:创建StringIndexerModel并保存到本地

StringIndexerModel model = new StringIndexer().setInputCol(column).setOutputCol("label").setHandleInvalid("skip").setStringOrderType("alphabetAsc").fit(originalDataset);
model.write().save(filelocation);

第 2 步:读取索引器模型并转换新数据集

StringIndexerModel model = StringIndexerModel.read().load(filelocation);
newDataset = model.transform(newDataset).drop(column).withColumnRenamed("label", column);

新数据集:

+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+
|age|capital gain|capital loss|education |education num|fnlgwt|hours per week|marital status |native country|occupation |race |relationship |sex |workclass |
+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+
|39 |2174 |0 | Bachelors|13 |77516 |40 | Never-married | United-States| Adm-clerical |White| Not-in-family|Male| State-gov |
|50 |0 |0 | Bachelors|13 |83311 |13 | Married-civ-spouse| United-States| Exec-managerial|White| Husband |Male| Self-emp-not-inc|
+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+

正确输出:

Column: education | File Location: localFolder/stringIndex/education
Labels: [ 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, Bachelors, Doctorate, HS-grad, Masters, Preschool, Prof-school, Some-college]
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|marital status |native country|occupation |race |relationship |sex |workclass |education|
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+
|39 |2174 |0 |13 |77516 |40 | Never-married | United-States| Adm-clerical |White| Not-in-family|Male| State-gov |9.0 |
|50 |0 |0 |13 |83311 |13 | Married-civ-spouse| United-States| Exec-managerial|White| Husband |Male| Self-emp-not-inc|9.0 |
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+

Column: marital status | File Location: localFolder/stringIndex/marital status
Labels: [ Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed]
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|native country|occupation |race |relationship |sex |workclass |education|marital status|
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+
|39 |2174 |0 |13 |77516 |40 | United-States| Adm-clerical |White| Not-in-family|Male| State-gov |9.0 |4.0 |
|50 |0 |0 |13 |83311 |13 | United-States| Exec-managerial|White| Husband |Male| Self-emp-not-inc|9.0 |2.0 |
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+

Column: native country | File Location: localFolder/stringIndex/native country
Labels: [ ?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia]
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|occupation |race |relationship |sex |workclass |education|marital status|native country|
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+
|39 |2174 |0 |13 |77516 |40 | Adm-clerical |White| Not-in-family|Male| State-gov |9.0 |4.0 |39.0 |
|50 |0 |0 |13 |83311 |13 | Exec-managerial|White| Husband |Male| Self-emp-not-inc|9.0 |2.0 |39.0 |
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+

Column: occupation | File Location: localFolder/stringIndex/occupation
Labels: [ ?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving]
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|race |relationship |sex |workclass |education|marital status|native country|occupation|
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+
|39 |2174 |0 |13 |77516 |40 |White| Not-in-family|Male| State-gov |9.0 |4.0 |39.0 |1.0 |
|50 |0 |0 |13 |83311 |13 |White| Husband |Male| Self-emp-not-inc|9.0 |2.0 |39.0 |4.0 |
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+

输出错误:除此之外所有其他模型都工作正常

Column: race | File Location: localFolder/stringIndex/race
Labels: [ Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White]
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|relationship|sex|workclass|education|marital status|native country|occupation|race|
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----+
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----+

如果您能帮助解决此问题,我将不胜感激。谢谢!

最佳答案

事实证明,新数据集的数据不正确。值之前应有空格。

添加空格'White'让我得到了正确的输出。

关于java - Spark StringIndexer 返回空数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59518208/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com