DataFrames, Datasets,RDDs

Like RDDs, DataFrames and Datasets represent distributed collections, with additional schema information not found in RDDs.

This additional schema information is used to provide a more efficient storage layer (Tungsten), and in the optimizer (Catalyst) to perform additional optimizations.Beyond schema information, the operations performed on Datasets and DataFrames are such that the optimizer can inspect the logical meaning rather than arbitrary functions. DataFrames are Datasets of a special Row object, which doesn’t provide any compile-time type checking. The strongly typed Dataset API shines especially for use with more RDD-like functional operations. Compared to working with RDDs, DataFrames allow Spark’s optimizer to better understand our code and our data, which allows for a new class of optimizations we explore in “Query Optimizer”.

Datasets were introduced in Spark 1.6, DataFrames in Spark 1.3,

Much as the SparkContext is the entry point for all Spark applications, and the StreamingContext is for all streaming applications, the SparkSession serves as the entry point for Spark SQL ( Earlier Spark 2.0 , it is the sqlcontext )

DataFrames

Record type information will be lost when working with DataFramesas RDDs.

Explanation

DataFramescan be implicitly converted to RDDs of Rows .. However, since the Spark SQL Rowobject is not strongly typed (it can be created from sequences of any value type), the Scala compiler cannot “remember” the type of value used to create the row. Indexing a row will return a value of type Any, which must be cast to a more specific type, such as a Stringor an Int

, to perform most calculations. The type information for the rows is stored in the schema. However, converting to an RDD throws away a DataFrame’s schema information, so it is important to bind the DataFrameschema to a variable.

DataSets

Record type information will be preserved when working with Dataset API . Meaning, it is strongly typed, so the values in each row will retain their type information even after conversion to an RDD.

since Spark 2.0, DataFrame is a special case of DataSet; it’s a DataSet containing Row objects

From Fast Data Processing with Spark 2 - Third Edition,

Now let's tie together the modalities with the Spark abstractions and see how we can read and write data. Before 2.0.0, things were conceptually simpler-we only needed to read data into RDDs and use map() to transform the data as required. However, data wrangling was harder. With Dataset/DataFrame, we have the ability to read directly into a table with headings, associate data types with domain semantics, and start working with data more effectively.

As a general rule of thumb, perform the following steps:

Use SparkContext and RDDs to handle unstructured data.

Use SparkSession and Datasets/DataFrames for semi-structured and structured data. 
As you will see in the later chapters, SparkSession has unified the read from various formats, such as the .csv, .json, .parquet, .jdbc, .orc, and .text files. Moreover, there is a pluggable architecture called DataSource API to access any type of structured data.

The Data Interfaces

There are several key interfaces that you should understand when you go to use Spark.

  • The RDD (Resilient Distributed Dataset)

    • Apache Spark's first abstraction was the RDD or Resilient Distributed Dataset. Essentially it is an interface to a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster. RDDs can be created in a variety of ways and are the "lowest level" API available to the user. While this is the original data structure made available, new users should focus on Datasets as those will be supersets of the current RDD functionality.

  • The DataFrame

    • The DataFrame is a collection of distributed

      Row

      types. These provide a flexible interface and are similar in concept to the DataFrames you may be familiar with in python (pandas) as well as in the R language.

  • The Dataset

    • The Dataset is Apache Spark's newest distributed collection and can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of the conveniences of DataFrames. It will be the core abstraction going forward.

Last updated