Caching

Caching - It’s a good practice to cache RDDs that are sources for generating a lot of other RDDs

Caching

Method cache is just a shorthand for persist, where a default storage level MEMORY_ONLY is used

persist

where a CPU cycle was a baseline representing one second.

RAM access was six minutes and access to SSD disk two to six days. Rotational disk, by comparison, takes one to twelve months. Remember that fact when persisting your data to the disk.

RDDs are evaluated lazily, so just calling cache has no effect

If you want to stop caching your RDDs in order to free up resources for further processing, you can simply call the **unpersist method** on a particular RDD. This method has an immediate effect, and you don’t have to wait for when the RDD is recomputed.

You can check the effects of calling cache and unpersist methods in Spark’s WebUI

PreviousResilient Distributed Dataset (RDD)NextPair RDDs

Last updated 5 years ago

Was this helpful?