spark notes
  • Introduction
  • Databricks
  • Concepts
  • Spark Execution Flow
    • SparkContext and SparkSession
  • Resilient Distributed Dataset (RDD)
    • Caching
    • Pair RDDs
    • Transformations
      • Depedency Resolution
    • Actions
    • Persistence
    • RDD lineage
    • Types of RDDs
    • Loading Data into RDDs
    • Data Locality with RDDs
    • How Many Partitions Does An RDD Have
  • Spark job submission breakdown
  • Why Cluster Manager
  • SparkContext and its components
  • Spark Architecture
    • Stages
    • Tasks
    • Executors
    • RDD
    • DAG
    • Jobs
    • Partitions
  • Spark Deployment Modes
  • Running Modes
  • Spark Execution Flow
  • DataFrames, Datasets,RDDs
  • SparkSQL
    • Architecture
    • Spark Session
  • Where Does Map Reduce Does not Fit
  • Actions
    • reduceByKey
    • count
    • collect, take, top, and first Actions
    • take
    • top
    • first
    • The reduce and fold Actions
  • DataSets
  • Spark Application Garbage Collector
  • How Mapreduce works in spark
  • Notes
  • Scala
  • Spark 2.0
  • Types Of RDDs
    • MapPartitionsRDD
  • Spark UI
  • Optimization
    • Tungsten
  • Spark Streaming
    • Notes
    • Flow
  • FlatMap - Different Variations
  • Examples
  • Testing Spark
  • Passing functions to Spark
  • CONFIGURATION, MONITORING, AND TUNING
  • References
Powered by GitBook
On this page

Was this helpful?

Spark Architecture

PreviousSparkContext and its componentsNextStages

Last updated 5 years ago

Was this helpful?

file:///G:/ScrapBook%20Data/Hadoop/data/20170920090454/index.html

When you execute something on a cluster, the processing of your job is split up into stages, and each stage is split into tasks. Each task is scheduled separately. You can consider each of the JVMs working as executors as a pool of task execution slots, each executor would give you

spark.executor.cores

/

spark.task.cpus

execution slots for your tasks, with a total of

spark.executor.instances

executors. Here’s an example.

The cluster with 12 nodes running YARN Node Managers, 64GB of RAM each and 32 CPU cores each (16 physical cores with hyper threading). This way on each node you can start 2 executors with 26GB of RAM each (leave some RAM for system processes, YARN NM and DataNode), each executor with 12 cores to be utilized for tasks (leave some cores for system processes, YARN NM and DataNode). So In total your cluster would handle 12 machines * 2 executors per machine * 12 cores per executor / 1 core for each task = 288 task slots. This means that your Spark cluster would be able to run up to 288 tasks in parallel thus utilizing almost all the resources you have on this cluster. The amount of RAM you can use for caching your data on this cluster is 0.9 spark.storage.safetyFraction * 0.6 spark.storage.memoryFraction * 12 machines * 2 executors per machine * 26 GB per executor = 336.96 GB.

Not that much, but in most cases it is enough.