gpt4 book ai didi

scala - Spark 2.0.0 : How to aggregate DataSet with custom encoded types?

转载 作者:行者123 更新时间:2023-12-03 07:12:32 25 4
gpt4 key购买 nike

我使用元组编码器和用于 LineString 的 kryo 编码器将一些数据存储为 DataSet[(Long, LineString)]

implicit def single[A](implicit c: ClassTag[A]): Encoder[A] = Encoders.kryo[A](c)
implicit def tuple2[A1, A2](implicit
e1: Encoder[A1],
e2: Encoder[A2]
): Encoder[(A1,A2)] = Encoders.tuple[A1,A2](e1, e2)
implicit val lineStringEncoder = Encoders.kryo[LineString]

val ds = segmentPoints.map(
sp => {
val p1 = new Coordinate(sp.lon_ini, sp.lat_ini)
val p2 = new Coordinate(sp.lon_fin, sp.lat_fin)
val coords = Array(p1, p2)

(sp.id, gf.createLineString(coords))
})
.toDF("id", "segment")
.as[(Long, LineString)]
.cache

ds.show

+----+--------------------+
| id | segment |
+----+--------------------+
| 347|[01 00 63 6F 6D 2...|
| 347|[01 00 63 6F 6D 2...|
| 347|[01 00 63 6F 6D 2...|
| 808|[01 00 63 6F 6D 2...|
| 808|[01 00 63 6F 6D 2...|
| 808|[01 00 63 6F 6D 2...|
+----+--------------------+

我可以对段列应用任何映射操作并使用基础 LineStrign 方法。

ds.map(_._2.getClass.getName).show(false)

+--------------------------------------+
|value |
+--------------------------------------+
|com.vividsolutions.jts.geom.LineString|
|com.vividsolutions.jts.geom.LineString|
|com.vividsolutions.jts.geom.LineString|

我想创建一些 UDAF 来处理具有相同 id 的段,我尝试了以下两种不同的方法,但没有成功:

1) 使用聚合器:

val length = new Aggregator[LineString, Double, Double] with Serializable {
def zero: Double = 0 // The initial value.
def reduce(b: Double, a: LineString) = b + a.getLength // Add an element to the running total
def merge(b1: Double, b2: Double) = b1 + b2 // Merge intermediate values.
def finish(b: Double) = b
// Following lines are missing on the API doc example but necessary to get
// the code compile
override def bufferEncoder: Encoder[Double] = Encoders.scalaDouble
override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}.toColumn

ds.groupBy("id")
.agg(length(col("segment")).as("kms"))
.show(false)

这里我收到以下错误:

 Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate [id#603L], [id#603L, anon$1(com.test.App$$anon$1@5bf1e07, None, input[0, double, true] AS value#715, cast(value#715 as double), input[0, double, true] AS value#714, DoubleType, DoubleType)['segment] AS kms#721];

2) 使用 UserDefinedAggregateFunction

class Length extends UserDefinedAggregateFunction {
val e = Encoders.kryo[LineString]

// This is the input fields for your aggregate function.
override def inputSchema: StructType = StructType(
StructField("segment", DataTypes.BinaryType) :: Nil
)

// This is the internal fields you keep for computing your aggregate.
override def bufferSchema: StructType = StructType(
StructField("length", DoubleType) :: Nil
)

// This is the output type of your aggregatation function.
override def dataType: DataType = DoubleType

override def deterministic: Boolean = true

// This is the initial value for your buffer schema.
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0.0
}

// This is how to update your buffer schema given an input.
override def update(buffer : MutableAggregationBuffer, input : Row) : Unit = {
// val l0 = input.getAs[LineString](0) // Can't cast to LineString (I guess because it is searialized using given encoder)
val b = input.getAs[Array[Byte]](0) // This works fine
val lse = e.asInstanceOf[ExpressionEncoder[LineString]]
val ls = lse.fromRow(???) // it expects InternalRow but input is a Row instance
// I also tried casting b.asInstance[InternalRow] without success.
buffer(0) = buffer.getAs[Double](0) + ls.getLength
}

// This is how to merge two objects with the bufferSchema type.
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getAs[Double](0) + buffer2.getAs[Double](0)
}

// This is where you output the final value, given the final value of your bufferSchema.
override def evaluate(buffer: Row): Any = {
buffer.getDouble(0)
}
}

val length = new Length
rseg
.groupBy("id")
.agg(length(col("segment")).as("kms"))
.show(false)

我做错了什么?我想使用自定义类型的聚合 API,而不是使用 rdd groupBy API。我搜索了 Spark 文档,但找不到这个问题的答案,看来目前还处于早期阶段。

谢谢。

最佳答案

根据此 answer ,没有简单的方法可以传递嵌套类型的自定义编码器,即您的情况下的 (Long,LineString) 。

一种选择是定义一个case class LineStringWithID,它将使用id: Long属性扩展LineString,并使用来自SQLImplicits的编码器

附注您能否将您的问题分解为更小的部分,每个部分一个主题?

关于scala - Spark 2.0.0 : How to aggregate DataSet with custom encoded types?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40909867/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com