Depedency Resolution
Last updated
Was this helpful?
Last updated
Was this helpful?
any data processing workflow could be defined as reading the data source, applying set of transformations and materializing the result in different ways. Transformations create dependencies between RDDs and here we can see different types of them.
So what is Spark dependency?…Spark dependency happens…when a transformation is executed…and one RDD is created from another RDD.…The question is,…does the transformation result in a shuffle?…Remember the data shuffle…we discussed in the previous lecture,…where data needs to be moved between Spark nodes…to execute the transformation?…So if the transformation creates a shuffle,…it's called a wide dependency.…
This is called wide dependency.Remember that wide dependenciescause data to flow between worker nodes.In big data, this flow is big,time-consuming, and a hog on resources.We of course want to minimize such flows.So which transformations do not create a shuffle?Some of the most popular ones are Map, Filter,flatMap, and mapPartitions.Which transformations do create a shuffle?Distinct, groupByKey, reduceByKey, and Join are some of the examples of transformationsthat create a shuffle.
So how do we optimize around these dependencies?First, do as many narrow dependencies togetherin your code as possiblebefore having to use a wide dependency.This will keep the processing within the node,and it will be ultra fast.It will make use of Spark the most.When you have to use wide dependencies,minimize their useand try to group as many such operations togetherin a single wide transformation function.
This way they get executed at onceand Spark will be able to applysome internal optimization for the same.
The dependencies are usually classified as "narrow" and "wide":
Narrow Depedency Resolution
Narrow (pipelineable) each partition of the parent RDD is used by at most one partition of the child RDD allow for pipelined execution on one cluster node failure recovery is more efficient as only lost parent partitions need to be recomputed.
Wide Depedency Resolution
Wide (shuffle)
multiple child partitions may depend on one parent partition.require data from all parent partitions to be available and to be shuffled across the nodes .if some partition is lost from all the ancestors a complete recomputation is needed.
wide transformations are those that require a shuffle, while narrow transformations are those that do not.