spark notes
  • Introduction
  • Databricks
  • Concepts
  • Spark Execution Flow
    • SparkContext and SparkSession
  • Resilient Distributed Dataset (RDD)
    • Caching
    • Pair RDDs
    • Transformations
      • Depedency Resolution
    • Actions
    • Persistence
    • RDD lineage
    • Types of RDDs
    • Loading Data into RDDs
    • Data Locality with RDDs
    • How Many Partitions Does An RDD Have
  • Spark job submission breakdown
  • Why Cluster Manager
  • SparkContext and its components
  • Spark Architecture
    • Stages
    • Tasks
    • Executors
    • RDD
    • DAG
    • Jobs
    • Partitions
  • Spark Deployment Modes
  • Running Modes
  • Spark Execution Flow
  • DataFrames, Datasets,RDDs
  • SparkSQL
    • Architecture
    • Spark Session
  • Where Does Map Reduce Does not Fit
  • Actions
    • reduceByKey
    • count
    • collect, take, top, and first Actions
    • take
    • top
    • first
    • The reduce and fold Actions
  • DataSets
  • Spark Application Garbage Collector
  • How Mapreduce works in spark
  • Notes
  • Scala
  • Spark 2.0
  • Types Of RDDs
    • MapPartitionsRDD
  • Spark UI
  • Optimization
    • Tungsten
  • Spark Streaming
    • Notes
    • Flow
  • FlatMap - Different Variations
  • Examples
  • Testing Spark
  • Passing functions to Spark
  • CONFIGURATION, MONITORING, AND TUNING
  • References
Powered by GitBook
On this page

Was this helpful?

  1. Resilient Distributed Dataset (RDD)

Caching

PreviousResilient Distributed Dataset (RDD)NextPair RDDs

Last updated 5 years ago

Was this helpful?

Caching - It’s a good practice to cache RDDs that are sources for generating a lot of other RDDs

Caching

Method cache is just a shorthand for persist, where a default storage level MEMORY_ONLY is used

persist

where a CPU cycle was a baseline representing one second.

RAM access was six minutes and access to SSD disk two to six days. Rotational disk, by comparison, takes one to twelve months. Remember that fact when persisting your data to the disk.

RDDs are evaluated lazily, so just calling cache has no effect

If you want to stop caching your RDDs in order to free up resources for further processing, you can simply call the **unpersist method** on a particular RDD. This method has an immediate effect, and you don’t have to wait for when the RDD is recomputed.

You can check the effects of calling cache and unpersist methods in Spark’s WebUI