gpt4 book ai didi

scala - 如何解决该异常:java.math.BigDecimal不是在doubled模式上重新在datadframe上应用模式时有效的外部类型?

转载 作者:行者123 更新时间:2023-12-02 19:16:32 25 4
gpt4 key购买 nike

我正在尝试通过以下方式将数据从表:system_releases从Greenplum移至Hive:

val yearDF = spark.read.format("jdbc").option("url", "urltemplate;MaxNumericScale=30;MaxNumericPrecision=40;")
.option("dbtable", s"(${execQuery}) as year2016")
.option("user", "user")
.option("password", "pwd")
.option("partitionColumn","release_number")
.option("lowerBound", 306)
.option("upperBound", 500)
.option("numPartitions",2)
.load()

通过spark推断dataFrame yearDF的模式:
description:string
status_date:timestamp
time_zone:string
table_refresh_delay_min:decimal(38,30)
online_patching_enabled_flag:string
release_number:decimal(38,30)
change_number:decimal(38,30)
interface_queue_enabled_flag:string
rework_enabled_flag:string
smart_transfer_enabled_flag:string
patch_number:decimal(38,30)
threading_enabled_flag:string
drm_gl_source_name:string
reverted_flag:string
table_refresh_delay_min_text:string
release_number_text:string
change_number_text:string

我在 hive 上有相同的表格,具有以下数据类型:
val hiveCols=string,status_date:timestamp,time_zone:string,table_refresh_delay_min:double,online_patching_enabled_flag:string,release_number:double,change_number:double,interface_queue_enabled_flag:string,rework_enabled_flag:string,smart_transfer_enabled_flag:string,patch_number:double,threading_enabled_flag:string,drm_gl_source_name:string,reverted_flag:string,table_refresh_delay_min_text:string,release_number_text:string,change_number_text:string

列: table_refresh_delay_min, release_number, change_number and patch_number给出了太多的小数点,即使GP中没有很多。
因此,我尝试将其保存为CSV文件,以查看spark如何读取数据。
例如,GP上的release_number的最大数量为:306.00,但是在csv文件中,我保存了数据框:yearDF,值变为306.000000000000000000。

我尝试采用配置单元表架构并将其转换为StructType,以将其应用于yearDF,如下所示。
def convertDatatype(datatype: String): DataType = {
val convert = datatype match {
case "string" => StringType
case "bigint" => LongType
case "int" => IntegerType
case "double" => DoubleType
case "date" => TimestampType
case "boolean" => BooleanType
case "timestamp" => TimestampType
}
convert
}

val schemaList = hiveCols.split(",")
val schemaStructType = new StructType(schemaList.map(col => col.split(":")).map(e => StructField(e(0), convertDatatype(e(1)), true)))
val newDF = spark.createDataFrame(yearDF.rdd, schemaStructType)
newDF.write.format("csv").save("hdfs/location")

但是我得到了错误:
Caused by: java.lang.RuntimeException: java.math.BigDecimal is not a valid external type for schema of double
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr8$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 17 more

我试图以下面的方式将十进制列转换为DoubleType,但是我仍然面临相同的异常。
  val pattern = """DecimalType\(\d+,(\d+)\)""".r
val df2 = dataDF.dtypes.
collect{ case (dn, dt) if pattern.findFirstMatchIn(dt).map(_.group(1)).getOrElse("0") != "0" => dn }.
foldLeft(dataDF)((accDF, c) => accDF.withColumn(c, col(c).cast("Double")))

Caused by: java.lang.RuntimeException: java.math.BigDecimal is not a valid external type for schema of double
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr8$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 17 more

在尝试实现上述两种方法后,我失去了主意。
谁能让我知道如何将数据框的列正确地转换为所需的数据类型?

最佳答案

在这种情况下,将RDD转换为DF时,需要指定与使用的Spark模式完全相同的类型。

例如,当您在printSchema DataFrame上执行yearDF时,您得到了

description:string
status_date:timestamp
time_zone:string
table_refresh_delay_min:decimal(38,30)
online_patching_enabled_flag:string
release_number:decimal(38,30)
change_number:decimal(38,30)
interface_queue_enabled_flag:string
rework_enabled_flag:string
smart_transfer_enabled_flag:string
patch_number:decimal(38,30)
threading_enabled_flag:string
drm_gl_source_name:string
reverted_flag:string
table_refresh_delay_min_text:string
release_number_text:string
change_number_text:string

当您将RDD转换为DF时,对于那些具有 decimal(38,30)的字段, 必须将指定为 DecimalType(38,30)而不是您使用的 DoubleType

希望能帮助到你!

关于scala - 如何解决该异常:java.math.BigDecimal不是在doubled模式上重新在datadframe上应用模式时有效的外部类型?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54548208/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com