spark notes
  • Introduction
  • Databricks
  • Concepts
  • Spark Execution Flow
    • SparkContext and SparkSession
  • Resilient Distributed Dataset (RDD)
    • Caching
    • Pair RDDs
    • Transformations
      • Depedency Resolution
    • Actions
    • Persistence
    • RDD lineage
    • Types of RDDs
    • Loading Data into RDDs
    • Data Locality with RDDs
    • How Many Partitions Does An RDD Have
  • Spark job submission breakdown
  • Why Cluster Manager
  • SparkContext and its components
  • Spark Architecture
    • Stages
    • Tasks
    • Executors
    • RDD
    • DAG
    • Jobs
    • Partitions
  • Spark Deployment Modes
  • Running Modes
  • Spark Execution Flow
  • DataFrames, Datasets,RDDs
  • SparkSQL
    • Architecture
    • Spark Session
  • Where Does Map Reduce Does not Fit
  • Actions
    • reduceByKey
    • count
    • collect, take, top, and first Actions
    • take
    • top
    • first
    • The reduce and fold Actions
  • DataSets
  • Spark Application Garbage Collector
  • How Mapreduce works in spark
  • Notes
  • Scala
  • Spark 2.0
  • Types Of RDDs
    • MapPartitionsRDD
  • Spark UI
  • Optimization
    • Tungsten
  • Spark Streaming
    • Notes
    • Flow
  • FlatMap - Different Variations
  • Examples
  • Testing Spark
  • Passing functions to Spark
  • CONFIGURATION, MONITORING, AND TUNING
  • References
Powered by GitBook
On this page

Was this helpful?

  1. Resilient Distributed Dataset (RDD)

RDD lineage

RDD lineage

plans, tracks, and manages the sequence of transformations that resulted in the RDD. This sequence is used to recover from

process failure.Spark keeps track of each RDD’s lineage: that is, the sequence of transformations that resulted in the

RDD. each RDD will have a parent RDD and/or a child RDD. Spark creates a DAG (directed acyclic graph) consisting of dependencies between RDDs. RDDs are processed in stages, which are sets of transformations. RDDs and stages have dependencies that can be narrow or wide.

Narrow dependencies or narrow operations are categorized by the following traits:

  • Operations can be collapsed into a single stage; for instance, a map() and filter() operation against elements in the same dataset can be processed in a single pass of each element in the dataset.

  • Only one child RDD depends on the parent RDD; for instance, an RDD is created from a text file (the parent RDD), with one child RDD to perform the set of transformations in one stage.

  • No shuffling of data between nodes is required.

Narrow operations are preferred because they maximize parallel execution and minimize shuffling, which can be a bottleneck and is quite expensive.

Wide dependencies of wide operations, in contrast, have the following traits:

  • Operations define a new stage and often require a shuffle operation.

  • RDDs have multiple dependencies; for instance, join() requires an RDD to be dependent upon two or more parent RDDs.

Wide operations are unavoidable when grouping, reducing, or joining datasets, but you should be aware of the impacts and overheads involved with these operations.

Lineage can be visualized using the DAG Visualization option link from the Jobs or Stage detail page.

PreviousPersistenceNextTypes of RDDs

Last updated 5 years ago

Was this helpful?