spark notes
  • Introduction
  • Databricks
  • Concepts
  • Spark Execution Flow
    • SparkContext and SparkSession
  • Resilient Distributed Dataset (RDD)
    • Caching
    • Pair RDDs
    • Transformations
      • Depedency Resolution
    • Actions
    • Persistence
    • RDD lineage
    • Types of RDDs
    • Loading Data into RDDs
    • Data Locality with RDDs
    • How Many Partitions Does An RDD Have
  • Spark job submission breakdown
  • Why Cluster Manager
  • SparkContext and its components
  • Spark Architecture
    • Stages
    • Tasks
    • Executors
    • RDD
    • DAG
    • Jobs
    • Partitions
  • Spark Deployment Modes
  • Running Modes
  • Spark Execution Flow
  • DataFrames, Datasets,RDDs
  • SparkSQL
    • Architecture
    • Spark Session
  • Where Does Map Reduce Does not Fit
  • Actions
    • reduceByKey
    • count
    • collect, take, top, and first Actions
    • take
    • top
    • first
    • The reduce and fold Actions
  • DataSets
  • Spark Application Garbage Collector
  • How Mapreduce works in spark
  • Notes
  • Scala
  • Spark 2.0
  • Types Of RDDs
    • MapPartitionsRDD
  • Spark UI
  • Optimization
    • Tungsten
  • Spark Streaming
    • Notes
    • Flow
  • FlatMap - Different Variations
  • Examples
  • Testing Spark
  • Passing functions to Spark
  • CONFIGURATION, MONITORING, AND TUNING
  • References
Powered by GitBook
On this page

Was this helpful?

Spark Execution Flow

PreviousRunning ModesNextDataFrames, Datasets,RDDs

Last updated 5 years ago

Was this helpful?

HighPerformance Spark 1st Edition

what happens when we start a SparkContext ?

.

First, the driver program pings the cluster manager.

The cluster manager launches a number of Spark executors (JVMs shown as black boxes in the diagram) on the worker nodes of the cluster (shown as blue circles).

One node can have multiple Spark executors, but an executor cannot span multiple nodes.

An RDD will be evaluated across the executors in partitions (shown as red rectangles).

Each executor can have multiple partitions, but a partition cannot be spread across multiple executors.

In the Spark lazy evaluation paradigm, a Spark application doesn’t “do anything” until the driver program calls an action.With each action, the Spark scheduler builds an execution graph and launches aSpark job.Each job consists ofstages, which are steps in the transformation of the data needed to materialize the final RDD. Each stage consists of a collection of_tasks_that represent each parallel computation and are performed on the executors.

shows a tree of the different components of a Spark application and how these correspond to the API calls. An application corresponds to starting a SparkContext/SparkSession. Each application may contain many jobs that correspond to one RDD action. Each job may contain several stages that correspond to each wide transformation. Each stage is composed of one or many tasks that correspond to a parallelizable unit of computation done in each stage. There is one task for each partition in the resulting RDD of that stage.

https://mapr.com/blog/getting-started-spark-web-ui/
Figure 2-5