gpt4 book ai didi

scala - Spark : Programmatically creating dataframe schema in scala

转载 作者:行者123 更新时间:2023-12-04 01:36:16 24 4
gpt4 key购买 nike

我有一个较小的数据集,它将是 Spark 作业的结果。为了在工作结束时方便起见,我正在考虑将此数据集转换为数据框,但一直在努力正确定义模式。问题是下面的最后一个字段( topValues );它是元组的 ArrayBuffer —— 键和计数。

  val innerSchema =
StructType(
Array(
StructField("value", StringType),
StructField("count", LongType)
)
)
val outputSchema =
StructType(
Array(
StructField("name", StringType, nullable=false),
StructField("index", IntegerType, nullable=false),
StructField("count", LongType, nullable=false),
StructField("empties", LongType, nullable=false),
StructField("nulls", LongType, nullable=false),
StructField("uniqueValues", LongType, nullable=false),
StructField("mean", DoubleType),
StructField("min", DoubleType),
StructField("max", DoubleType),
StructField("topValues", innerSchema)
)
)

val result = stats.columnStats.map{ c =>
Row(c._2.name, c._1, c._2.count, c._2.empties, c._2.nulls, c._2.uniqueValues, c._2.mean, c._2.min, c._2.max, c._2.topValues.topN)
}

val rdd = sc.parallelize(result.toSeq)

val outputDf = sqlContext.createDataFrame(rdd, outputSchema)

outputDf.show()

我得到的错误是 MatchError: scala.MatchError: ArrayBuffer((10,2), (20,3), (8,1)) (of class scala.collection.mutable.ArrayBuffer)
当我调试和检查我的对象时,我看到了这个:
rdd: ParallelCollectionRDD[2]
rdd.data: "ArrayBuffer" size = 2
rdd.data(0): [age,2,6,0,0,3,14.666666666666666,8.0,20.0,ArrayBuffer((10,2), (20,3), (8,1))]
rdd.data(1): [gender,3,6,0,0,2,0.0,0.0,0.0,ArrayBuffer((M,4), (F,2))]

在我看来,我已经在我的 innerSchema 中准确地描述了元组的 ArrayBuffer,但 Spark 不同意。

知道我应该如何定义架构吗?

最佳答案

val rdd = sc.parallelize(Array(Row(ArrayBuffer(1,2,3,4))))
val df = sqlContext.createDataFrame(
rdd,
StructType(Seq(StructField("arr", ArrayType(IntegerType, false), false)
)

df.printSchema
root
|-- arr: array (nullable = false)
| |-- element: integer (containsNull = false)

df.show
+------------+
| arr|
+------------+
|[1, 2, 3, 4]|
+------------+

关于scala - Spark : Programmatically creating dataframe schema in scala,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36317002/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com