gpt4 book ai didi

python - 将数据框放入 randomForest pyspark

转载 作者:太空宇宙 更新时间:2023-11-03 13:32:41 25 4
gpt4 key购买 nike

我有一个看起来像这样的 DataFrame:

+--------------------+------------------+
| features| labels |
+--------------------+------------------+
|[-0.38475, 0.568...]| label1 |
|[0.645734, 0.699...]| label2 |
| ..... | ... |
+--------------------+------------------+

两列都是字符串类型(StringType()),我想将其放入spark ml randomForest中。为此,我需要将特征列转换为包含 float 的向量。有没有人知道如何做到这一点?

最佳答案

如果您使用的是 Spark 2.x,我相信这就是您所需要的:

from pyspark.sql.functions import udf
from pyspark.mllib.linalg import Vectors
from pyspark.ml.linalg import VectorUDT
from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))

def parse(s):
try:
return Vectors.parse(s).asML()
except:
return None

parse_ = udf(parse, VectorUDT())

parsed = df.withColumn("features", parse_("features"))

indexer = StringIndexer(inputCol="label", outputCol="label_indexed")

indexer.fit(parsed).transform(parsed).show()
## +----------------+------+-------------+
## | features| label|label_indexed|
## +----------------+------+-------------+
## |[-0.38475,0.568]|label1| 0.0|
## |[0.645734,0.699]|label2| 1.0|
## +----------------+------+-------------+

对于 Spark 1.6,这并没有太大的不同:

from pyspark.sql.functions import udf
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.linalg import Vectors, VectorUDT

df = sqlContext.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))

parse_ = udf(Vectors.parse, VectorUDT())

parsed = df.withColumn("features", parse_("features"))

indexer = StringIndexer(inputCol="label", outputCol="label_indexed")

indexer.fit(parsed).transform(parsed).show()
## +----------------+------+-------------+
## | features| label|label_indexed|
## +----------------+------+-------------+
## |[-0.38475,0.568]|label1| 0.0|
## |[0.645734,0.699]|label2| 1.0|
## +----------------+------+-------------+

Vectors 有一个parse 函数,可以帮助您实现您想要做的事情。

关于python - 将数据框放入 randomForest pyspark,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44325555/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com