gpt4 book ai didi

scala - 为什么在 Spark 中传递格式错误的时区字符串时 from_utc_timstamp 不抛出错误?

转载 作者:行者123 更新时间:2023-12-05 05:01:56 31 4
gpt4 key购买 nike

在 spark 2.4.3 中调用 from_utc_timestamp 函数时,如果我传入格式错误的时区字符串,则不会抛出任何错误。相反,它只是默认为 UTC,这与我的预期相反,而且似乎也可能导致错误被忽视。这是故意的,还是 Spark 中的错误?

请看下面的例子:

scala> val df =  Seq(("2020-01-01 00:00:00")).toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]

scala> df.show()

// Not a real timezone obviously. Just gets treated like UTC.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "not_a_real_timezone")).show()
+-------------------+-------------------+
| date| est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2020-01-01 00:00:00|
+-------------------+-------------------+

// Typo in EST5PDT, so still not a real timezone. Also defaults to UTC, which makes it
// very easy to miss this mistake.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "EST5PDT")).show()
+-------------------+-------------------+
| date| est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2020-01-01 00:00:00|
+-------------------+-------------------+

// EST8EDT is a real timezone, so this works as expected.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "EST5EDT")).show()
+-------------------+-------------------+
| date| est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2019-12-31 19:00:00|
+-------------------+-------------------+

最佳答案

from_utc_timestamp 使用 org.apache.spark.sql.catalyst.util 中的 DateTimeUtils。要获取时区,他们使用 getTimeZone 方法。此方法通常不会引发问题。

  • 这可能是 JVM 的事情,JVM 试图避免依赖于系统的默认语言环境、字符集和时区
  • 这可能是要记录在 Jira 中的 Spark 问题

但是看看其他人的代码,他们确实有一个先检查的设置:

import java.util.TimeZone

...

if (!TimeZone.getAvailableIDs().contains(tz)) {
throw new IllegalStateException(s"The setting '$tz' is not recognized as known time zone")
}

编辑1:刚发现这是一个“功能”。它在 3.0.0 的迁移指南中

In Spark version 2.4 and earlier, invalid time zone ids are silently ignored and replaced by GMT time zone, for example, in the from_utc_timestamp function. Since Spark 3.0, such time zone ids are rejected, and Spark throws java.time.DateTimeException.

https://spark.apache.org/docs/3.0.0-preview/sql-migration-guide.html

关于scala - 为什么在 Spark 中传递格式错误的时区字符串时 from_utc_timstamp 不抛出错误?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62437770/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com