Persistence

Extra: persistence

explicitly store RDDs in memory, on disk or off-heap \(cache, persist\)

checkpointing for truncating RDD lineage

The behavior of not persisting by default may again seem unusual, but it makes a lot of sense for big datasets: if you will not reuse the RDD, there’s no reason to waste storage space when Spark could instead stream through the data once and just compute the result.

In practice, you will often use persist\(\) to load a subset of your data into memory and query it repeatedly

If you attempt to cache too much data to fit in memory, Spark will automatically

evict old partitions using a Least Recently Used (LRU) cache policy. For the memoryonly storage levels, it will recompute these partitions the next time they are accessed, while for the memory-and-disk ones, it will write them out to disk. In either case, this means that you don’t have to worry about your job breaking if you ask Spark to cache too much data. However, caching unnecessary data can lead to eviction of useful data and more recomputation time.

Last updated