gpt4 book ai didi

scala - 如何替换 Vector 列中的空值?

转载 作者:行者123 更新时间:2023-12-01 09:23:45 26 4
gpt4 key购买 nike

我有一个 [vector] 类型的列,其中有我无法摆脱的空值,这是一个例子

import org.apache.spark.mllib.linalg.Vectors

val sv1: Vector = Vectors.sparse(58, Array(8, 45), Array(1.0, 1.0))
val df_1 = sc.parallelize(List(("id_1", sv1))).toDF("id", "feature_vector")
val df_2 = sc.parallelize(List(("id_1", 10.0), ("id_2", 10.0))).toDF("id", "numeric_feature")

val df_joined = df_1.join(df_2, Seq("id"), "right")

df_joined.show()

+----+--------------------+---------------+
| id| feature_vector|numeric_feature|
+----+--------------------+---------------+
|id_1|(58,[8,45],[1.0,1...| 10.0|
|id_2| null| 10.0|
+----+--------------------+---------------+

我想做什么:

val map = Map("feature_vector" -> sv1)
val result = df_joined.na.fill(map)

但这会引发错误:

Message: Unsupported value type org.apache.spark.mllib.linalg.SparseVector ((58,[8,45],[1.0,1.0])).

我尝试过的其他事情:

df_joined.withColumn("feature_vector", when(col("feature_vector").isNull, sv1).otherwise(sv1)).show

来自 how to filter out a null value from spark dataframe

我正在努力寻找适用于 Spark 1.6 的解决方案

最佳答案

合并和加入应该可以解决问题

import org.apache.spark.sql.functions.{coalesce, broadcast}

val fill = Seq(
Tuple1(Vectors.sparse(58, Array(8, 45), Array(1.0, 1.0)))
).toDF("fill")


df_joined
.join(broadcast(fill))
.withColumn("feature_vector", coalesce($"feature_vector", $"fill"))
.drop("fill")

关于scala - 如何替换 Vector 列中的空值?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50742385/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com