gpt4 book ai didi

scala - spark中的求和向量列

转载 作者:行者123 更新时间:2023-12-04 01:47:28 29 4
gpt4 key购买 nike

我有一个数据框,其中有多个包含向量的列(向量列的数量是动态的)。我需要创建一个新列,取所有向量列的总和。我很难完成这件事。这是生成我正在测试的示例数据集的代码。

import org.apache.spark.ml.feature.VectorAssembler

val temp1 = spark.createDataFrame(Seq(
(1,1.0,0.0,4.7,6,0.0),
(2,1.0,0.0,6.8,6,0.0),
(3,1.0,1.0,7.8,5,0.0),
(4,0.0,1.0,4.1,7,0.0),
(5,1.0,0.0,2.8,6,1.0),
(6,1.0,1.0,6.1,5,0.0),
(7,0.0,1.0,4.9,7,1.0),
(8,1.0,0.0,7.3,6,0.0)))
.toDF("id", "f1","f2","f3","f4","label")

val assembler1 = new VectorAssembler()
.setInputCols(Array("f1","f2","f3"))
.setOutputCol("vec1")

val temp2 = assembler1.setHandleInvalid("skip").transform(temp1)

val assembler2 = new VectorAssembler()
.setInputCols(Array("f2","f3", "f4"))
.setOutputCol("vec2")

val df = assembler2.setHandleInvalid("skip").transform(temp2)

这给了我以下数据集
+---+---+---+---+---+-----+-------------+-------------+
| id| f1| f2| f3| f4|label| vec1| vec2|
+---+---+---+---+---+-----+-------------+-------------+
| 1|1.0|0.0|4.7| 6| 0.0|[1.0,0.0,4.7]|[0.0,4.7,6.0]|
| 2|1.0|0.0|6.8| 6| 0.0|[1.0,0.0,6.8]|[0.0,6.8,6.0]|
| 3|1.0|1.0|7.8| 5| 0.0|[1.0,1.0,7.8]|[1.0,7.8,5.0]|
| 4|0.0|1.0|4.1| 7| 0.0|[0.0,1.0,4.1]|[1.0,4.1,7.0]|
| 5|1.0|0.0|2.8| 6| 1.0|[1.0,0.0,2.8]|[0.0,2.8,6.0]|
| 6|1.0|1.0|6.1| 5| 0.0|[1.0,1.0,6.1]|[1.0,6.1,5.0]|
| 7|0.0|1.0|4.9| 7| 1.0|[0.0,1.0,4.9]|[1.0,4.9,7.0]|
| 8|1.0|0.0|7.3| 6| 0.0|[1.0,0.0,7.3]|[0.0,7.3,6.0]|
+---+---+---+---+---+-----+-------------+-------------+

如果我需要计算常规列的总和,我可以使用类似的方法来完成,
import org.apache.spark.sql.functions.col

df.withColumn("sum", namesOfColumnsToSum.map(col).reduce((c1, c2)=>c1+c2))

我知道我可以使用微风仅使用“+”运算符来对 DenseVectors 求和
import breeze.linalg._
val v1 = DenseVector(1,2,3)
val v2 = DenseVector(5,6,7)
v1+v2

所以,上面的代码给了我预期的向量。但我不确定如何取向量列的总和并求和 vec1vec2列。

我确实尝试了提到的建议 here ,但没有运气

最佳答案

这是我的看法,但在 PySpark 中编码。有人可能会帮助将其翻译成 Scala:

from pyspark.ml.linalg import Vectors, VectorUDT
import numpy as np
from pyspark.sql.functions import udf, array

def vector_sum (arr):
return Vectors.dense(np.sum(arr,axis=0))

vector_sum_udf = udf(vector_sum, VectorUDT())

df = df.withColumn('sum',vector_sum_udf(array(['vec1','vec2'])))

关于scala - spark中的求和向量列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54697620/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com