gpt4 book ai didi

dataframe - Spark中DataFrame、Dataset、RDD的区别

转载 作者:行者123 更新时间:2023-12-03 06:08:18 30 4
gpt4 key购买 nike

我只是想知道RDDDataFrame之间有什么区别(Spark 2.0.0 DataFrame只是Dataset的类型别名[Row]) 在 Apache Spark 中?

你能将其中一种转换为另一种吗?

最佳答案

First thing is DataFrame was evolved from SchemaRDD.

depreated method toSchemaRDD

是的..DataframeRDD 之间的转换是绝对可能的。

下面是一些示例代码片段。

  • df.rddRDD[Row]

以下是创建数据框的一些选项。

  • 1) yourrddOffrow.toDF 转换为 DataFrame

  • 2) 使用sql上下文的createDataFrame

    val df = Spark.createDataFrame(rddOfRow, schema)

where schema can be from some of below options as described by nice SO post..
From scala case class and scala reflection api

import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[YourScalacaseClass].dataType.asInstanceOf[StructType]

OR using Encoders

import org.apache.spark.sql.Encoders
val mySchema = Encoders.product[MyCaseClass].schema

as described by Schema can also be created using StructType and StructField

val schema = new StructType()
.add(StructField("id", StringType, true))
.add(StructField("col1", DoubleType, true))
.add(StructField("col2", DoubleType, true)) etc...

image description

In fact there Are Now 3 Apache Spark APIs..

enter image description here

  1. RDD API:

The RDD (Resilient Distributed Dataset) API has been in Spark since the 1.0 release.

The RDD API provides many transformation methods, such as map(), filter(), and reduce() for performing computations on the data. Each of these methods results in a new RDD representing the transformed data. However, these methods are just defining the operations to be performed and the transformations are not performed until an action method is called. Examples of action methods are collect() and saveAsObjectFile().

RDD 示例:

rdd.filter(_.age > 21) // transformation
.map(_.last)// transformation
.saveAsObjectFile("under21.bin") // action

示例:使用 RDD 按属性过滤

rdd.filter(_.age > 21)
  • DataFrame API

  • Spark 1.3 introduced a new DataFrame API as part of the Project Tungsten initiative which seeks to improve the performance and scalability of Spark. The DataFrame API introduces the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization.

    The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute. The API is natural for developers who are familiar with building query plans

    SQL 样式示例:

    df.filter("年龄 > 21");

    限制:由于代码通过名称引用数据属性,因此编译器不可能捕获任何错误。如果属性名称不正确,则仅在创建查询计划时才会在运行时检测到错误。

    DataFrame API 的另一个缺点是它非常以 scala 为中心,虽然它确实支持 Java,但支持有限。

    例如,当从 Java 对象的现有 RDD 创建 DataFrame 时,Spark 的 Catalyst 优化器无法推断架构,并假设 DataFrame 中的任何对象都实现 scala.Product接口(interface)。 Scala case class 可以正常工作,因为它们实现了此接口(interface)。

  • 数据集 API

  • The Dataset API, released as an API preview in Spark 1.6, aims to provide the best of both worlds; the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API.

    When it comes to serializing data, the Dataset API has the concept of encoders which translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. Spark does not yet provide an API for implementing custom encoders, but that is planned for a future release.

    Additionally, the Dataset API is designed to work equally well with both Java and Scala. When working with Java objects, it is important that they are fully bean-compliant.

    数据集 API SQL 样式示例:

    dataset.filter(_.age < 21);

    评估差异。 DataFrameDataSet 之间: enter image description here

    Catalist level flow. .(揭秘 Spark 峰会上的 DataFrame 和 Dataset 演示) enter image description here

    进一步阅读...databricks article - A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets

    关于dataframe - Spark中DataFrame、Dataset、RDD的区别,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37301226/

    30 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com