scala - 如何更改 StructType 的 StructField 中列的数据类型？-6ren

scala - 如何更改 StructType 的 StructField 中列的数据类型？

转载作者：行者123 更新时间：2023-12-02 00:23:56

25

4

我正在尝试更改从 RDBMS 数据库读取的数据框中存在的列的数据类型。为此，我通过以下方式获得了数据框的架构:

val dataSchema = dataDF.schema

为了查看数据框的架构，我使用了以下语句:

println(dataSchema.schema)

Output: StructType(StructField(je_header_id,LongType,true), StructField(je_line_num,LongType,true), StructField(last_update_date,TimestampType,true), StructField(last_updated_by,DecimalType(15,0),true), StructField(creation_date,TimestampType,true), StructField(created_by,DecimalType(15,0),true), StructField(created_by_name,StringType,true), StructField(entered_dr,DecimalType(38,30),true), StructField(entered_cr,DecimalType(38,30),true))

我的要求是从上述架构中找到 DecimalType 并将其更改为 DoubleType。我可以使用以下方法获取列名和数据类型:dataSchema.dtype 但它以 ((columnName1, column datatype),(columnName2, column datatype)....(columnNameN, column datatype) 的格式给我数据类型))

我试图找到一种方法来解析 StructType 并徒劳地更改 dataSchema 中的模式。

任何人都可以让我知道是否有解析 StructType 的方法，以便我可以将数据类型更改为我的要求并获得以下格式

StructType(StructField(je_header_id,LongType,true), StructField(je_line_num,LongType,true), StructField(last_update_date,TimestampType,true), StructField(last_updated_by,DoubleType,true), StructField(creation_date,TimestampType,true), StructField(created_by,DoubleType,true), StructField(created_by_name,StringType,true), StructField(entered_dr,DoubleType,true), StructField(entered_cr,DoubleType,true))

最佳答案

要修改特定于给定数据类型的 DataFrame Schema，您可以对 StructField 进行模式匹配的dataType，如下图:

import org.apache.spark.sql.types._

val df = Seq(
  (1L, BigDecimal(12.34), "a", BigDecimal(10.001)),
  (2L, BigDecimal(56.78), "b", BigDecimal(20.002))
).toDF("c1", "c2", "c3", "c4")

val newSchema = df.schema.fields.map{
  case StructField(name, _: DecimalType, nullable, _)
    => StructField(name, DoubleType, nullable)
  case field => field
}
// newSchema: Array[org.apache.spark.sql.types.StructField] = Array(
//   StructField(c1,LongType,false), StructField(c2,DoubleType,true),
//   StructField(c3,StringType,true), StructField(c4,DoubleType,true)
// )

但是，假设您的最终目标是通过更改列类型来转换数据集，那么只遍历目标数据类型的列以迭代地 cast 它们会更容易，如下所示:

import org.apache.spark.sql.functions._

val df2 = df.dtypes.
  collect{ case (dn, dt) if dt.startsWith("DecimalType") => dn }.
  foldLeft(df)((accDF, c) => accDF.withColumn(c, col(c).cast("Double")))

df2.printSchema
// root
//  |-- c1: long (nullable = false)
//  |-- c2: double (nullable = true)
//  |-- c3: string (nullable = true)
//  |-- c4: double (nullable = true)

[更新]

根据评论的附加要求，如果您只想更改具有正比例的 DecimalType 的架构，只需在方法 guard 中应用 Regex 模式匹配作为条件 收集:

val pattern = """DecimalType\(\d+,(\d+)\)""".r

val df2 = df.dtypes.
  collect{ case (dn, dt) if pattern.findFirstMatchIn(dt).map(_.group(1)).getOrElse("0") != "0" => dn }.
  foldLeft(df)((accDF, c) => accDF.withColumn(c, col(c).cast("Double")))

关于scala - 如何更改 StructType 的 StructField 中列的数据类型？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54423030/

25

4

0

文章推荐： angular - RxJS 阻止 switchMap 首先发送请求

文章推荐： php - WooCommerce 结帐字段设置和自定义 Hook

文章推荐： react-native - React Native Slider - onValueChange 调用次数过多

scala - 如何使用一个或多个 StructType 创建模式(StructType)？
我正在尝试创建一个 StructType在另一个 StructType ，但它只允许添加 StructField .我找不到任何添加 StructType 的方法到它。如何创建 StructType
C# struct new StructType() 与 default(StructType)
假设我有一个结构 public struct Foo { ... } 有什么区别吗 Foo foo = new Foo(); 和 Foo foo = default(Foo); ? 最佳答案
scala - 如何比较两个共享相同内容的 StructType？
好像StructType保留顺序，所以两个 StructType包含相同的 StructField s 不被认为是等效的。例如: val st1 = StructType( StructField(
pyspark - StructType 不能接受对象？
如何解决这个问题？ rdd.collect() //['3e866d48b59e8ac8aece79597df9fb4c'...] rdd.toDF() //Can not infer sch
python - Pyspark StructType 未定义
我正在尝试构建用于数据库测试的架构，而 StructType 显然由于某种原因无法正常工作。我正在关注 tut，它不会导入任何额外的模块。 , NameError("name 'StructType'
Pyspark 错误将 StructType 传递给 Schema
在将 StructType 传递给架构方法时出现错误:TypeError: 'StructType' object is not callable。下面是代码: final_schema = Stru
python - Pyspark 将 StructType 列聚合为每行元素的数组
这个问题已经有答案了: pyspark collect_set or collect_list with groupby (2 个回答) 已关闭 4 年前。我正在尝试做一些看起来非常简单的事情，但不
c++ - 使用 `StructType structInstance = {};` 初始化结构的行为是什么？
使用 StructType structInstance = {}; 初始化结构的行为是什么？它只是使用默认构造函数和/或将所有成员数据初始化为 null 吗？有关 Vulkan 教程中的示例，请参阅
python - 从 pyspark 中的数据框构建 StructType
我是 spark 和 python 的新手，面临着从可应用于我的数据文件的元数据文件构建模式的困难。场景:数据文件的元数据文件(csv 格式)，包含列及其类型:例如: id,int,10,"","",
scala - Spark : Why the StructType merge method is private?
spark.sql.types package 中有一个merge 方法: private[sql] def merge(that: StructType): StructType 它是私有(priv
apache-spark - pyspark:使用 JavaObject StructType
我需要解析 JSON schema文件以创建 pyspark.sql.types.StructType。我找到了 scala library可以为我做这个。所以我这样调用它: f = open('pa
json - Spark from_json - StructType 和 ArrayType
我有一个以 XML 形式出现的数据集，其中一个节点包含 JSON。 Spark 将其作为 StringType 读取，因此我尝试使用 from_json() 将 JSON 转换为 DataFrame。
scala - 如何在不使用案例类但使用 StructType 的情况下创建数据集(不是 DataFrame)？
如何使用 StructType 创建数据集？我们可以如下创建一个Dataset: case class Person(name: String, age: Int) val personDS = S
json - Spark 将 StructType/JSON 转换为字符串
我有一个换行符分隔的 json 文件，看起来像 {"id":1,"nested_col": {"key1": "val1", "key2": "val2", "key3": ["arr1", "arr
pandas - 属性错误: 'StructType' object has no attribute 'encode'
我正在尝试从 pandas 数据帧创建 Spark 数据帧。我正在基于由数组的结构类型和结构字段组成的模式构建模式。以下是示例架构: mySchema = ( StructType(
apache-spark - 从案例类生成 Spark StructType/Schema
如果我想从case class创建一个StructType(即DataFrame.schema)，有没有办法做到不创建 DataFrame 吗？我可以轻松做到: case class TestCase
scala - 如何更改 StructType 的 StructField 中列的数据类型？
我正在尝试更改从 RDBMS 数据库读取的数据框中存在的列的数据类型。为此，我通过以下方式获得了数据框的架构: val dataSchema = dataDF.schema 为了查看数据框的架构，我使
python - StructType 不能接受 pyspark 中的对象 float
为什么它工作得很好 from pyspark.sql.types import * l=[("foo",83.33)] schema = StructType([ StructField("ty
scala - 如何更改 StructType 的 StructField 中列的数据类型？
我正在尝试更改从 RDBMS 数据库读取的数据框中存在的列的数据类型。为此，我通过以下方式获得了数据框的架构: val dataSchema = dataDF.schema 为了查看数据框的架构，我使
python - StructType 不能接受 pyspark 中的对象 float
为什么它工作得很好 from pyspark.sql.types import * l=[("foo",83.33)] schema = StructType([ StructField("ty

首页

博学

6Ren·AI

商城

scala - 如何更改 StructType 的 StructField 中列的数据类型？