gpt4 book ai didi

python - 将 PySpark DenseVector 转换为数组

转载 作者:行者123 更新时间:2023-12-01 07:04:45 58 4
gpt4 key购买 nike

我正在尝试将 DenseVector 的 pyspark 数据帧列转换为数组,但总是出现错误。

data = [(Vectors.dense([8.0, 1.0, 3.0, 2.0, 5.0]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]

df = spark.createDataFrame(data,["features"])

我尝试定义一个 UDF 并使用 toArray()

to_array = udf(lambda x: x.toArray(), ArrayType(FloatType()))
df = df.withColumn('features', to_array('features'))

但是,如果我执行 df.collect(),我会收到以下错误

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 17.0 failed 4 times, 
most recent failure: Lost task 1.3 in stage 17.0 (TID 100, 10.139.64.6, executor 0):
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict
(for numpy.core.multiarray._reconstruct)

知道如何实现这一目标吗?

最佳答案

toArray()返回一个 numpy.ndarray ,它无法隐式转换为 ArrayType(FloatType()) 。另外使用 .tolist() 来转换它:

import pyspark.sql.functions as F
import pyspark.sql.types as T

#or: to_array = F.udf(lambda v: list([float(x) for x in v]), T.ArrayType(T.FloatType()))
to_array = F.udf(lambda v: v.toArray().tolist(), T.ArrayType(T.FloatType()))
df = df.withColumn('features', to_array('features'))

如果您使用 Pyspark >=3.0.0,您可以使用新的 vector_to_array功能:

from pyspark.ml.functions import vector_to_array
df = df.withColumn('features', vector_to_array('features'))

关于python - 将 PySpark DenseVector 转换为数组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58490770/

58 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com