Spark Execution Flow
Last updated
Was this helpful?
Last updated
Was this helpful?
HighPerformance Spark 1st Edition
what happens when we start a SparkContext ?
.
First, the driver program pings the cluster manager.
The cluster manager launches a number of Spark executors (JVMs shown as black boxes in the diagram) on the worker nodes of the cluster (shown as blue circles).
One node can have multiple Spark executors, but an executor cannot span multiple nodes.
An RDD will be evaluated across the executors in partitions (shown as red rectangles).
Each executor can have multiple partitions, but a partition cannot be spread across multiple executors.
In the Spark lazy evaluation paradigm, a Spark application doesn’t “do anything” until the driver program calls an action.With each action, the Spark scheduler builds an execution graph and launches aSpark job.Each job consists ofstages, which are steps in the transformation of the data needed to materialize the final RDD. Each stage consists of a collection of_tasks_that represent each parallel computation and are performed on the executors.
shows a tree of the different components of a Spark application and how these correspond to the API calls. An application corresponds to starting a SparkContext
/SparkSession
. Each application may contain many jobs that correspond to one RDD action. Each job may contain several stages that correspond to each wide transformation. Each stage is composed of one or many tasks that correspond to a parallelizable unit of computation done in each stage. There is one task for each partition in the resulting RDD of that stage.