gpt4 book ai didi

python - Spark 和 Ipython 中将非数字特征编码为数字的问题

转载 作者:行者123 更新时间:2023-11-30 09:37:26 24 4
gpt4 key购买 nike

我正在研究一些必须对 numeric 进行预测的事情数据(每月员工支出)使用 non-numeric特征。我正在使用Spark MLlibs Random Forests algorthim 。我有我的features数据在 dataframe看起来像这样:

     _1      _2     _3              _4  
0 Level1 Male New York New York
1 Level1 Male San Fransisco California
2 Level2 Male New York New York
3 Level1 Male Columbus Ohio
4 Level3 Male New York New York
5 Level4 Male Columbus Ohio
6 Level5 Female Stamford Connecticut
7 Level1 Female San Fransisco California
8 Level3 Male Stamford Connecticut
9 Level6 Female Columbus Ohio

这里的列是 - employee level , gender , city , state这些是我的features我想用它来预测员工每月的支出(标签,以美元为单位)。

训练标签集如下所示:

3528
4958
4958
1652
4958
6528
4958
4958
5528
7000

features位于non-numeric表格所以我需要 encode他们到 numeric 。所以我关注this link编码categorical data进入numbers 。我为此编写了这段代码(遵循链接文章中提到的过程):

import numpy as np
from sklearn.feature_extraction import DictVectorizer as DV
import pandas as pd
def extract(line):
return (line[1],line[2],line[3],line[7],line[9],line[10],line[22])

inputfile = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)


input_data = (inputfile
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >1 )
.map(extract)) # Map to tuples

(train_data, test_data) = input_data.randomSplit([0.8, 0.2])

# converting RDD to dataframe
train_dataframe = train_data.toDF()
# converting to pandas dataframe
train_pandas = train_dataframe.toPandas()
# filtering features
train_pandas_features = train_pandas.iloc[:,:6]
# filtering label
train_pandas_label = train_pandas.iloc[:,6]

train_pandas_features_dict = train_pandas_features.T.to_dict().values()

# encoding features to numeric
vectorizer = DV( sparse = False )
vec_train = vectorizer.fit_transform( train_pandas_features_dict )

当我这样做时print vec_train我只看到0.在所有功能列中。像这样的事情:

[[ 0.  0.  0. ...,  0.  0.  0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]

我认为我在某个地方犯了一些错误,因此 encoding没有产生正确的结果。我犯了什么错误?还有其他更好的方法encode non-numeric featuresnumeric对于我在顶部描述的情况(根据非数字员工数据预测数字每月支出)?

最佳答案

一般来说,如果您有可以使用 Pandas 数据帧和 scikit-learn 处理的数据,则使用 Spark 似乎是一个严重的矫枉过正。不过,如果您这样做,那么一直使用 Spark 工具可能更有意义。让我们从为您的功能建立索引开始:

from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import VectorAssembler

label_col = "x3" # For example

# I assume this comes from your previous question
df = (rdd.map(lambda row: [row[i] for i in columns_num])
.toDF(("x0", "x1", "x2", "x3")))

# Indexers encode strings with doubles
string_indexers = [
StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))

# For classifications problems
# - if you want to use ML you should index label as well
# - if you want to use MLlib it is not necessary
# For regression problems you should omit label in the indexing
# as shown below
for x in df.columns if x not in {label_col} # Exclude other columns if needed
]

# Assembles multiple columns into a single vector
assembler = VectorAssembler(
inputCols=["idx_{0}".format(x) for x in df.columns if x != label_col],
outputCol="features"
)


pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(df)
indexed = model.transform(df)

上面定义的管道将创建以下数据框:

indexed.printSchema()
## root
## |-- x0: string (nullable = true)
## |-- x1: string (nullable = true)
## |-- x2: string (nullable = true)
## |-- x3: string (nullable = true)
## |-- idx_x0: double (nullable = true)
## |-- idx_x1: double (nullable = true)
## |-- idx_x2: double (nullable = true)
## |-- features: vector (nullable = true)

其中 features 应该是 mllib.tree.DecisionTree 的有效输入(请参阅 SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint? )。

您可以按如下方式创建标签点:

from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col

label_points = (indexed
.select(col(label_col).alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))

关于python - Spark 和 Ipython 中将非数字特征编码为数字的问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33981740/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com