DataFrames, Datasets,RDDs
Last updated
Was this helpful?
Last updated
Was this helpful?
Like RDDs, DataFrames
and Datasets
represent distributed collections, with additional schema information not found in RDDs.
This additional schema information is used to provide a more efficient storage layer (Tungsten), and in the optimizer (Catalyst) to perform additional optimizations.Beyond schema information, the operations performed on Datasets
and DataFrames
are such that the optimizer can inspect the logical meaning rather than arbitrary functions. DataFrames
are Datasets
of a special Row
object, which doesn’t provide any compile-time type checking. The strongly typed Dataset API shines especially for use with more RDD-like functional operations. Compared to working with RDDs, DataFrames
allow Spark’s optimizer to better understand our code and our data, which allows for a new class of optimizations we explore in .
Datasets were introduced in Spark 1.6, DataFrames in Spark 1.3,
Much as the SparkContext
is the entry point for all Spark applications, and the StreamingContext
is for all streaming applications, the SparkSession
serves as the entry point for Spark SQL ( Earlier Spark 2.0 , it is the sqlcontext )
Record type information will be lost when working with DataFrames
as RDDs.
Explanation
DataFrames
can be implicitly converted to RDDs of Rows .
. However, since the Spark SQL Row
object is not strongly typed (it can be created from sequences of any value type), the Scala compiler cannot “remember” the type of value used to create the row. Indexing a row will return a value of type Any
, which must be cast to a more specific type, such as a String
or an Int
, to perform most calculations. The type information for the rows is stored in the schema. However, converting to an RDD throws away a DataFrame
’s schema information, so it is important to bind the DataFrame
schema to a variable.
Record type information will be preserved when working with Dataset API . Meaning, it is strongly typed, so the values in each row will retain their type information even after conversion to an RDD.
since Spark 2.0, DataFrame is a special case of DataSet; it’s a DataSet containing Row objects
From Fast Data Processing with Spark 2 - Third Edition,
Now let's tie together the modalities with the Spark abstractions and see how we can read and write data. Before 2.0.0, things were conceptually simpler-we only needed to read data into RDDs and use map() to transform the data as required. However, data wrangling was harder. With Dataset/DataFrame, we have the ability to read directly into a table with headings, associate data types with domain semantics, and start working with data more effectively.
As a general rule of thumb, perform the following steps:
There are several key interfaces that you should understand when you go to use Spark.
The RDD (Resilient Distributed Dataset)
Apache Spark's first abstraction was the RDD or Resilient Distributed Dataset. Essentially it is an interface to a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster. RDDs can be created in a variety of ways and are the "lowest level" API available to the user. While this is the original data structure made available, new users should focus on Datasets as those will be supersets of the current RDD functionality.
The DataFrame
The DataFrame is a collection of distributed
Row
types. These provide a flexible interface and are similar in concept to the DataFrames you may be familiar with in python (pandas) as well as in the R language.
The Dataset
The Dataset is Apache Spark's newest distributed collection and can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of the conveniences of DataFrames. It will be the core abstraction going forward.