Resilient Distributed Dataset (RDD)

An RDD is an immutable and distributed collection of elements that can be operated upon in parallel. It is the basic abstraction that Spark uses to function. RDDs support various kinds of transformations on them.

Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including userdefined classes. RDDs are lazily evaluated data structures. In short, that means there is no processing associated with calling transformations on RDDs right away.

One important thing to remember about RDDs is that they are lazily evaluated; when transformation is called upon them, no actual work is done right away. Only the information about the source of RDD is stored and the transformation has to be applied. toDebugString method can be used in the RDD to get text representation of DAG

Last updated