spark notes
  • Introduction
  • Databricks
  • Concepts
  • Spark Execution Flow
    • SparkContext and SparkSession
  • Resilient Distributed Dataset (RDD)
    • Caching
    • Pair RDDs
    • Transformations
      • Depedency Resolution
    • Actions
    • Persistence
    • RDD lineage
    • Types of RDDs
    • Loading Data into RDDs
    • Data Locality with RDDs
    • How Many Partitions Does An RDD Have
  • Spark job submission breakdown
  • Why Cluster Manager
  • SparkContext and its components
  • Spark Architecture
    • Stages
    • Tasks
    • Executors
    • RDD
    • DAG
    • Jobs
    • Partitions
  • Spark Deployment Modes
  • Running Modes
  • Spark Execution Flow
  • DataFrames, Datasets,RDDs
  • SparkSQL
    • Architecture
    • Spark Session
  • Where Does Map Reduce Does not Fit
  • Actions
    • reduceByKey
    • count
    • collect, take, top, and first Actions
    • take
    • top
    • first
    • The reduce and fold Actions
  • DataSets
  • Spark Application Garbage Collector
  • How Mapreduce works in spark
  • Notes
  • Scala
  • Spark 2.0
  • Types Of RDDs
    • MapPartitionsRDD
  • Spark UI
  • Optimization
    • Tungsten
  • Spark Streaming
    • Notes
    • Flow
  • FlatMap - Different Variations
  • Examples
  • Testing Spark
  • Passing functions to Spark
  • CONFIGURATION, MONITORING, AND TUNING
  • References
Powered by GitBook
On this page
  • SparkContext
  • SparkSession - Starting from Spark 2.x

Was this helpful?

  1. Spark Execution Flow

SparkContext and SparkSession

SparkContext

SparkContext is the first object that a Spark program must create to access the cluster. In spark-shell, it is directly accessible via spark.sparkContext.

Here's how you can programmatically create SparkContext in your Scala code:

import org.apache.spark.SparkContext

import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName("my app").setMaster("master url")

new SparkContext(conf)

SparkSession - Starting from Spark 2.x

SparkContext, though still supported, was more relevant in the case of RDD . As you will see, different libraries have different wrappers around SparkContext, for example, HiveContext/SQLContext for Spark SQL, StreamingContext for Streaming, and

so on.

As all the libraries are moving toward DataSet/DataFrame, it makes sense to have a unified entry point for all these libraries as well, and that is SparkSession. SparkSession is available as spark in the spark-shell.

Here's how you do it:

import org.apache.spark.SparkContext

import org.apache.spark.SparkConf

val sparkSession = SparkSession.builder.master("master url").appName("my app").getOrCreate()

PreviousSpark Execution FlowNextResilient Distributed Dataset (RDD)

Last updated 5 years ago

Was this helpful?