Transformations

Every transformation creates a new RDD. One important thing to remember about RDDs is that they are lazily evaluated; when transformation is called upon them, no actual work is done right away. Only the information about the source of RDD is stored and the transformation has to be applied.

Transformations construct a new RDD from a previous one. Transformations just apply the transformation and typically will not include transferring the data across the nodes.

RDDs are lazily evaluated data structures. In short, that means there is no processing associated with calling transformations on RDDs right away.

a lot of benefits from the fact that the transformations are lazily evaluated

Some of them are that operations can be grouped together, reducing the networking between the nodes processing the data; there are no multiple passes over the same data.

there is a pitfall associated with all this

Upon user request for action, Spark will perform the calculations. If the data between the steps is not cached, Spark will reevaluate the expressions again because

RDDs are not materialized. We can instruct Spark to materialize the computed operations by calling cache or persist.

Every transformation creates a new RDD.

Last updated