RDD lineage

RDD lineage

plans, tracks, and manages the sequence of transformations that resulted in the RDD. This sequence is used to recover from

process failure.Spark keeps track of each RDD’s lineage: that is, the sequence of transformations that resulted in the

RDD. each RDD will have a parent RDD and/or a child RDD. Spark creates a DAG (directed acyclic graph) consisting of dependencies between RDDs. RDDs are processed in stages, which are sets of transformations. RDDs and stages have dependencies that can be narrow or wide.

Narrow dependencies or narrow operations are categorized by the following traits:

  • Operations can be collapsed into a single stage; for instance, a map() and filter() operation against elements in the same dataset can be processed in a single pass of each element in the dataset.

  • Only one child RDD depends on the parent RDD; for instance, an RDD is created from a text file (the parent RDD), with one child RDD to perform the set of transformations in one stage.

  • No shuffling of data between nodes is required.

Narrow operations are preferred because they maximize parallel execution and minimize shuffling, which can be a bottleneck and is quite expensive.

Wide dependencies of wide operations, in contrast, have the following traits:

  • Operations define a new stage and often require a shuffle operation.

  • RDDs have multiple dependencies; for instance, join() requires an RDD to be dependent upon two or more parent RDDs.

Wide operations are unavoidable when grouping, reducing, or joining datasets, but you should be aware of the impacts and overheads involved with these operations.

Lineage can be visualized using the DAG Visualization option link from the Jobs or Stage detail page.

Last updated