spark notes
  • Introduction
  • Databricks
  • Concepts
  • Spark Execution Flow
    • SparkContext and SparkSession
  • Resilient Distributed Dataset (RDD)
    • Caching
    • Pair RDDs
    • Transformations
      • Depedency Resolution
    • Actions
    • Persistence
    • RDD lineage
    • Types of RDDs
    • Loading Data into RDDs
    • Data Locality with RDDs
    • How Many Partitions Does An RDD Have
  • Spark job submission breakdown
  • Why Cluster Manager
  • SparkContext and its components
  • Spark Architecture
    • Stages
    • Tasks
    • Executors
    • RDD
    • DAG
    • Jobs
    • Partitions
  • Spark Deployment Modes
  • Running Modes
  • Spark Execution Flow
  • DataFrames, Datasets,RDDs
  • SparkSQL
    • Architecture
    • Spark Session
  • Where Does Map Reduce Does not Fit
  • Actions
    • reduceByKey
    • count
    • collect, take, top, and first Actions
    • take
    • top
    • first
    • The reduce and fold Actions
  • DataSets
  • Spark Application Garbage Collector
  • How Mapreduce works in spark
  • Notes
  • Scala
  • Spark 2.0
  • Types Of RDDs
    • MapPartitionsRDD
  • Spark UI
  • Optimization
    • Tungsten
  • Spark Streaming
    • Notes
    • Flow
  • FlatMap - Different Variations
  • Examples
  • Testing Spark
  • Passing functions to Spark
  • CONFIGURATION, MONITORING, AND TUNING
  • References
Powered by GitBook
On this page

Was this helpful?

  1. Resilient Distributed Dataset (RDD)
  2. Transformations

Depedency Resolution

PreviousTransformationsNextActions

Last updated 5 years ago

Was this helpful?

any data processing workflow could be defined as reading the data source, applying set of transformations and materializing the result in different ways. Transformations create dependencies between RDDs and here we can see different types of them.

So what is Spark dependency?…Spark dependency happens…when a transformation is executed…and one RDD is created from another RDD.…The question is,…does the transformation result in a shuffle?…Remember the data shuffle…we discussed in the previous lecture,…where data needs to be moved between Spark nodes…to execute the transformation?…So if the transformation creates a shuffle,…it's called a wide dependency.…

This is called wide dependency.Remember that wide dependenciescause data to flow between worker nodes.In big data, this flow is big,time-consuming, and a hog on resources.We of course want to minimize such flows.So which transformations do not create a shuffle?Some of the most popular ones are Map, Filter,flatMap, and mapPartitions.Which transformations do create a shuffle?Distinct, groupByKey, reduceByKey, and Join are some of the examples of transformationsthat create a shuffle.

So how do we optimize around these dependencies?First, do as many narrow dependencies togetherin your code as possiblebefore having to use a wide dependency.This will keep the processing within the node,and it will be ultra fast.It will make use of Spark the most.When you have to use wide dependencies,minimize their useand try to group as many such operations togetherin a single wide transformation function.

This way they get executed at onceand Spark will be able to applysome internal optimization for the same.

The dependencies are usually classified as "narrow" and "wide":

Narrow Depedency Resolution

Narrow (pipelineable) each partition of the parent RDD is used by at most one partition of the child RDD allow for pipelined execution on one cluster node failure recovery is more efficient as only lost parent partitions need to be recomputed.

Wide Depedency Resolution

Wide (shuffle)

multiple child partitions may depend on one parent partition.require data from all parent partitions to be available and to be shuffled across the nodes .if some partition is lost from all the ancestors a complete recomputation is needed.

how to determine whether a transformation is wide or narrow ?

wide transformations are those that require a shuffle, while narrow transformations are those that do not.