scala - Scala中两个数据框的架构比较-6ren

scala - Scala中两个数据框的架构比较

转载作者：行者123 更新时间：2023-12-04 05:11:34

我正在尝试编写一些测试用例，以验证源（.csv）文件和目标（配置单元表）之间的数据。验证之一是表的结构验证。

我已经将.csv数据（使用已定义的架构）加载到一个数据帧中，并将配置单元表数据提取到了另一个数据帧中。
现在，当我尝试比较两个数据帧的架构时，它返回false。不知道为什么。有什么想法吗？

源数据帧架构：

scala> res39.printSchema
root
 |-- datetime: timestamp (nullable = true)
 |-- load_datetime: timestamp (nullable = true)
 |-- source_bank: string (nullable = true)
 |-- emp_name: string (nullable = true)
 |-- header_row_count: integer (nullable = true)
 |-- emp_hours: double (nullable = true)

目标数据框架构：

scala> targetRawData.printSchema
root
 |-- datetime: timestamp (nullable = true)
 |-- load_datetime: timestamp (nullable = true)
 |-- source_bank: string (nullable = true)
 |-- emp_name: string (nullable = true)
 |-- header_row_count: integer (nullable = true)
 |-- emp_hours: double (nullable = true)

当我比较时，它返回false：

scala> res39.schema == targetRawData.schema
res47: Boolean = false

两个数据框中的数据如下所示：

scala> res39.show
+-------------------+-------------------+-----------+--------+----------------+---------+
|           datetime|      load_datetime|source_bank|emp_name|header_row_count|emp_hours|
+-------------------+-------------------+-----------+--------+----------------+---------+
|2017-01-01 01:02:03|2017-01-01 01:02:03|        RBS| Naveen |             100|    15.23|
|2017-03-15 01:02:03|2017-03-15 01:02:03|        RBS| Naveen |             100|   115.78|
|2015-04-02 23:24:25|2015-04-02 23:24:25|        RBS|   Arun |             200|     2.09|
|2010-05-28 12:13:14|2010-05-28 12:13:14|        RBS|   Arun |             100|    30.98|
|2018-06-04 10:11:12|2018-06-04 10:11:12|        XZX|   Arun |             400|     12.0|
+-------------------+-------------------+-----------+--------+----------------+---------+


scala> targetRawData.show
+-------------------+-------------------+-----------+--------+----------------+---------+
|           datetime|      load_datetime|source_bank|emp_name|header_row_count|emp_hours|
+-------------------+-------------------+-----------+--------+----------------+---------+
|2017-01-01 01:02:03|2017-01-01 01:02:03|        RBS|  Naveen|             100|    15.23|
|2017-03-15 01:02:03|2017-03-15 01:02:03|        RBS|  Naveen|             100|   115.78|
|2015-04-02 23:25:25|2015-04-02 23:25:25|        RBS|    Arun|             200|     2.09|
|2010-05-28 12:13:14|2010-05-28 12:13:14|        RBS|    Arun|             100|    30.98|
+-------------------+-------------------+-----------+--------+----------------+---------+

完整的代码如下所示：

//import org.apache.spark
import org.apache.spark.sql.hive._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions.{to_date, to_timestamp}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.text._
import java.util.Date
import scala.util._
import org.apache.spark.sql.hive.HiveContext

  //val conf = new SparkConf().setAppName("Simple Application")
  //val sc = new SparkContext(conf)
  val hc = new HiveContext(sc)
  val spark: SparkSession = SparkSession.builder().appName("Simple Application").config("spark.master", "local").getOrCreate()

   // set source and target location
    val sourceDataLocation = "hdfs://localhost:9000/source.txt"
    val targetTableName = "TableA"

    // Extract source data
    println("Extracting SAS source data from csv file location " + sourceDataLocation);
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    val sourceRawCsvData = sc.textFile(sourceDataLocation)

    println("Extracting target data from hive table " + targetTableName)
    val targetRawData = hc.sql("Select datetime,load_datetime,trim(source_bank) as source_bank,trim(emp_name) as emp_name,header_row_count, emp_hours from " + targetTableName)


    // Add the test cases here
    // Test 2 - Validate the Structure
       val headerColumns = sourceRawCsvData.first().split(",").to[List]
       val schema = TableASchema(headerColumns)

       val data = sourceRawCsvData.mapPartitionsWithIndex((index, element) => if (index == 0) element.drop(1) else element)
       .map(_.split(",").toList)
       .map(row)

       val dataFrame = spark.createDataFrame(data,schema)
       val sourceDataFrame = dataFrame.toDF(dataFrame.columns map(_.toLowerCase): _*)
       data.collect
       data.getClass
    // Test 3 - Validate the data
    // Test 4 - Calculate the average and variance of Int or Dec columns
    // Test 5 - Test 5

  def UpdateResult(tableName: String, returnCode: Int, description: String){
    val insertString = "INSERT INTO TestResult VALUES('" + tableName + "', " + returnCode + ",'" + description + "')"
    val a = hc.sql(insertString)

    }


  def TableASchema(columnName: List[String]): StructType = {
    StructType(
      Seq(
        StructField(name = "datetime", dataType = TimestampType, nullable = true),
        StructField(name = "load_datetime", dataType = TimestampType, nullable = true),
        StructField(name = "source_bank", dataType = StringType, nullable = true),
        StructField(name = "emp_name", dataType = StringType, nullable = true),
        StructField(name = "header_row_count", dataType = IntegerType, nullable = true),
        StructField(name = "emp_hours", dataType = DoubleType, nullable = true)
        )
    )
  }

  def row(line: List[String]): Row = {
       Row(convertToTimestamp(line(0).trim), convertToTimestamp(line(1).trim), line(2).trim, line(3).trim, line(4).toInt, line(5).toDouble)
    }


  def convertToTimestamp(s: String) : Timestamp = s match {
     case "" => null
     case _ => {
        val format = new SimpleDateFormat("ddMMMyyyy:HH:mm:ss")
        Try(new Timestamp(format.parse(s).getTime)) match {
        case Success(t) => t
        case Failure(_) => null
      }
    }
  }

  }

最佳答案

基于@Derek Kaknes的答案，这是我想出的用于比较模式的解决方案，仅关注列名，数据类型和可空性并且对元数据无动于衷

// Extract relevant information: name (key), type & nullability (values) of columns
def getCleanedSchema(df: DataFrame): Map[String, (DataType, Boolean)] = {
    df.schema.map { (structField: StructField) =>
      structField.name.toLowerCase -> (structField.dataType, structField.nullable)
    }.toMap
  }

// Compare relevant information
def getSchemaDifference(schema1: Map[String, (DataType, Boolean)],
                        schema2: Map[String, (DataType, Boolean)]
                       ): Map[String, (Option[(DataType, Boolean)], Option[(DataType, Boolean)])] = {
  (schema1.keys ++ schema2.keys).
    map(_.toLowerCase).
    toList.distinct.
    flatMap { (columnName: String) =>
      val schema1FieldOpt: Option[(DataType, Boolean)] = schema1.get(columnName)
      val schema2FieldOpt: Option[(DataType, Boolean)] = schema2.get(columnName)

      if (schema1FieldOpt == schema2FieldOpt) None
      else Some(columnName -> (schema1FieldOpt, schema2FieldOpt))
    }.toMap
}

getCleanedSchema方法提取感兴趣的信息-列数据类型和可空性，并将列名称的 map返回到 tuple
getSchemaDifference方法返回一个 map，其中仅包含两个架构中不同的列。如果两个模式之一都不存在列，则其对应的属性为 None

关于scala - Scala中两个数据框的架构比较，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47862974/

文章推荐： jenkins - 您如何处理声明式管道中的全局变量？

文章推荐： web-scraping - 使用美丽汤的请求被阻止

文章推荐： nsbezierpath - 如何获得不包括控制点的 UIBezierPath 的边界

文章推荐： amazon-web-services - AWS Dynamo无法自动缩减

Kubernetes 架构
是否可以简化在裸机上运行的这条链: 具有随时间变化的副本数的 StatefulSet 服务使用 proxy-next-upstream: "error http_502 timeout invali
Facebook 架构
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
MySQL产品存储-架构
我需要为应用程序制定架构。它专为销售产品而设计。系统每天将接受大约 30-40k 的新产品。它将导致在表 product 中创建新记录。系统应保留价格历史记录。用户应该能够看到产品 A 的价格在去
PHP 架构 : How do I do that?
我需要一些帮助来理解 PHP 的内部工作原理。还记得，在过去，我们曾经写过 TSR(Terminate and stay resident)例程(pre-windows 时代)吗？一旦该程序被执行，
让我一起浅析Nginx 架构
1.Nginx 基础架构 nginx 启动后以 daemon 形式在后台运行，后台进程包含一个 master 进程和多个 worker 进程。如下图所示： master与
K8s技术全景：架构、应用与优化
本文深入探讨了Kubernetes（K8s）的关键方面，包括其架构、容器编排、网络与存储管理、安全与合规、高可用性、灾难恢复以及监控与日志系统。关注【TechLeadCloud】，
tensorflow - 如何为任何通用数据集确定卷积神经网络的结构/架构？
我知道 CNN 的工作原理，包括每一层的用途(Dropout、Pooling 等)。但是，在为新数据集设计 CNN 时，我不知道要使用多少个 Conv-Relu-Pool 层，在最终获得输出之前我应该
REST 架构 - 资源和方法
在基于 REST 的架构中，资源和方法之间有什么区别。有吗？最佳答案资源是您的应用程序定义的东西；它们与物体非常相似。方法是 HTTP 动词之一，例如 GET、POST、PUT、DELETE。它们
Json 架构 "not in"枚举类型？
我想用 oneOf仅在 xyType 的值上不同的模式属性(property)。我想要其中两个:一个是 xyType设置为 "1"第二个在哪里xyType是任何其他值 .这可以使用 json 模式完
PHP 架构，以及按引用传递与按值传递
寻求 PHP 架构师的建议! 我对 PHP 不是很熟悉，但已经接管了一个用该语言编写的大型分析包的维护工作。该架构旨在将报告的数据读取到大型键/值数组中，这些数组通过各种解析模块传递，以提取每个模块已
JavaScript 架构/应用程序结构最佳实践？
这些存在吗？多年来，我一直是大型强类型面向对象语言(Java 和 C#)的奴隶，并且是 Martin Fowler 及其同类的信徒。 Javascript，由于它的松散类型和函数性质，似乎不适合我习
Lambda 架构 - 这个名字的由来是什么？
我已经阅读了 Manning 的 Big Data Lambda Architecture ( http://www.manning.com/marz/BD_meap_ch01.pdf )，但仍然无法
xcode - 高级应用程序设计/架构
在过去的几年里，我做了相当多的 iOS 开发，所以我非常熟悉 iOS 架构和应用程序设计(一切都是一个 ViewController，您可以将其推送、弹出或粘贴到选项卡栏中)。我最近开始探索正确的 M
javascript - AngularJS 架构
我有以下应用程序，我在其中循环一些数据并显示它。 {{thing.title}} {{thing.description}}
c# - 架构/设计模式问题
昨天我和我的伙伴讨论了我正在开发的这个电子购物网站的架构。请注意，我为此使用 ASP.NET。他非常惊讶地发现我没有将添加到购物车的项目保留在 ArrayList 或其他通用列表中，而是使用 LINQ
tridion - 隐藏继承的内容/架构
我正在使用在 tridion 蓝图层次结构中处于较低位置的出版物。从蓝图中较高级别的出版物继承的一些内容和模式不适合我的出版物，并且永远不会被我的出版物使用。我将跟进添加这些项目的内部团队，并尝试说
java - Cassandra 架构
我目前已经在 Cassandra 中设计了一个架构，但我想知道是否有更好的方法来做事情。基本上，问题在于大多数(如果不是全部)读取都是动态的。我构建了一个分段系统作为应用程序服务，读取动态自定义查询(
Icinga2 IDO 架构
我正在按照 documentation 中给出的 icingaweb UI v 2.0 布局执行在服务器上设置 icinga 的步骤。。我成功进入设置页面，该页面要求您输入 token ，然后按照步
java - Mongodb 架构
我必须保存来自不同社交媒体的用户的不同个人资料。例如用户可能有 1 个 Facebook 和 2 个 Twitter 个人资料。如果我保存每个配置文件它作为新文档插入不同的集合中，例如 faceboo
适用于多个应用程序的多个环境的 Puppet 架构
我的团队使用 Puppet 架构，该架构目前可在多个环境(流浪者、暂存、生产)中容纳单个应用程序。我们现在想要扩展此设置的范围以支持其他应用程序。他们中的许多人将使用我们已经定义的现有模块的子集，而

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

scala - Scala中两个数据框的架构比较