DataSets

Datasets are an exciting extension of Spark SQL that provide additional compile-time type checking. Starting in Spark 2.0, DataFrames are now a specialized version of Datasets that operate on generic Row objects and therefore lack the normal compile-time type checking of Datasets.

Datasets can be used when your data can be encoded for Spark SQL and you know the type information at compile time. The Dataset API is a strongly typed collection with a mixture of relational (DataFrame) and functional (RDD) transformations. Like DataFrames, Datasets are represented by a logical plan the Catalyst optimizer (see “Query Optimizer”) can work with, and when cached the data is stored in Spark SQL’s internal encoding format.

Last updated