Caching
Last updated
Was this helpful?
Last updated
Was this helpful?
Caching - It’s a good practice to cache RDDs that are sources for generating a lot of other RDDs
Caching
Method cache is just a shorthand for persist, where a default storage level MEMORY_ONLY is used
persist
where a CPU cycle was a baseline representing one second.
RAM access was six minutes and access to SSD disk two to six days. Rotational disk, by comparison, takes one to twelve months. Remember that fact when persisting your data to the disk.
RDDs are evaluated lazily, so just calling cache has no effect
If you want to stop caching your RDDs in order to free up resources for further processing, you can simply call the **unpersist method** on a particular RDD. This method has an immediate effect, and you don’t have to wait for when the RDD is recomputed.
You can check the effects of calling cache and unpersist methods in Spark’s WebUI