Spark and Scala (Application Development) Workshop
Last updated
Was this helpful?
Last updated
Was this helpful?
This Spark and Scala workshop is supposed to give you a practical, complete and more importantly hands-on introduction to the architecture of Apache Spark and how to use Spark's Scala API (developer) and infrastructure (administrator, devops) effectively in your Big Data projects.
NOTE: Should you want a workshop about administration, monitoring, troubleshooting and fine-tuning of Apache Spark, check out .
The agenda is the result of the workshops I hosted in the following cities and a few online classes:
Toronto (3 classes)
Mississauga
Plymouth Meeting
Montreal
London
The workshop uses an intense code-first approach in which the modules start with just enough knowledge to get you going (mostly using scaladoc and live coding) and quickly move on to applying the concepts in programming assignments. There are a lot of them.
It comes with many practical sessions that should meet (and even exceed) expectations of software developers (and perhaps administrators, operators, devops, and other technical roles like system architects or technical leads).
The workshop provides participants with practical skills to use the features of Apache Spark with Scala.
CAUTION: The Spark and Scala workshop is very hands-on and practical, i.e. not for faint-hearted. Seriously! After 5 days your mind, eyes, and hands will all be trained to recognise the patterns where and how to use Spark and Scala in your Big Data projects.
CAUTION I have already trained people who expressed their concern that there were too many exercises. Your dear drill sergeant, Jacek.
5 days
Software developers who know Scala and want to explore the Spark space
Software developers with programming experience in a similar general-purpose programming language (Java, Python, Ruby)
Non-programming IT professionals like administrators, devops, system architects or technical leads to learn about Spark and Scala by their APIs.
After completing the workshop participants should be able to:
Use functional programming concepts in Scala
Describe Spark and the use cases for Apache Spark
Explain the RDD distributed data abstraction
Explore large datasets using interactive Spark Shell
Develop Spark applications using Scala API
Assemble and submit Spark applications to Spark clusters
Use Spark in local mode as well as Spark Standalone clusters
Install Spark development and runtime environments
Understand the concepts of Spark SQL (DataFrame, Dataset, UDF)
Understand the concepts of Spark Streaming (DStream, ReceiverInputDStream)
Understand the concepts of Spark MLlib (Pipeline API)
Understand the concepts of Spark GraphX (RDD-based Graph API)
Build processing pipelines using Spark's RDD and abstractions above in Spark SQL (DataFrame), Spark Streaming (DStream), and Spark MLlib (Pipeline API).
Explain the internals of Spark and execution model
This module aims at introducing Scala and the tools, i.e. sbt and Scala REPL, to complete the other Spark modules.
This module covers:
Scala REPL
Literals and Values
Basic Types (Numerics, Strings)
Type System
Imports
More Advanced Types (Tuples, Options)
Parameterized Types
Expressions and Conditions
Methods, Functions (and Procedures)
Using placeholders (_
) in functions
Scala Collection API and Common Collection Types
Seqs, Lists, Sets, Maps
filter
, map
, flatMap
, zipWithIndex
, count
Implicits and Multiple Parameter Lists
Understanding method signatures
Case Classes, Objects, and Traits
Command-line Applications
Packages
Pattern Matching
Partial Functions
Using case
to destructure input parameters
sbt - the build tool for Scala projects
Agenda:
Using sbt
The tasks: help
, compile
, test
, package
, update
, ~
, set
, show
, console
Tab experience
Configuration files and directories, i.e. build.sbt
file and project
directory
Adding new tasks to sbt through plugins
Global vs project plugins
Using sbt behind a proxy server
Proxy Repositories for sbt
DataFrames
Exercise: Creating DataFrames
Seqs and toDF
SQLContext.createDataFrame
and Explicit Schema using StructType
DataFrames and Query DSL
Column References: col
, $
, '
, dfName
Exercise: Using Query DSL to select
columns
where
User-Defined Functions (UDFs)
functions object
Exercise: Manipulating DataFrames using functions
withColumn
UDFs: split
and explode
Creating new UDFs
DataFrameWriter and DataFrameReader
SQLContext.read
and load
DataFrame.write
and save
Exercise: WordCount using DataFrames (words per line)
SQLContext.read.text
SQLContext.read.format("text")
Exercise: Manipulating data from CSV using DataFrames
spark-submit --packages com.databricks:spark-csv_2.10:1.4.0
SQLContext.read.csv
vs SQLContext.read.format("csv")
or format("com.databricks.spark.csv")
count
Aggregating
Exercise: Using groupBy
and agg
Exercise: WordCount using DataFrames (words per file)
Windowed Aggregates (Windows)
Exercise: Top N per Group
Exercise: Revenue Difference per Category
Exercise: Running Totals
Datasets
Exercise: WordCount using SQLContext.read.text
Exercise: Compute Aggregates using mapGroups
Word Count using Datasets
Caching
Exercise: Measuring Query Times using web UI
Accessing Structured Data using JDBC
Modern / New-Age Approach
Exercise: Reading Data from and Writing to PostgreSQL
Integration with Hive
Queries over DataFrames
sql
Registering UDFs
Temporary and permanent tables
registerTempTable
DataFrame.write
and saveAsTable
DataFrame performance optimizations
Tungsten
Catalyst
Spark MLlib vs Spark ML
(old-fashioned) RDD-based API vs (the latest and gratest) DataFrame-based API
Transformers
Exercise: Using Tokenizer, RegexTokenizer, and HashingTF
Estimators and Models
Exercise: Using KMeans
Fitting a model and checking spams
Exercise: Using LogisticRegression
Fitting a model and checking spams
Pipelines
Exercise: Using Pipelines of Transformers and Estimators
Spark Streaming
Exercise: ConstantInputDStream in motion in Standalone Streaming Application
Input DStreams (with and without Receivers)
Exercise: Processing Files Using File Receiver
Word Count
Exercise: Using Text Socket Receiver
Exercise: Processing vmstat
Using Apache Kafka
Monitoring Streaming Applications using web UI (Streaming tab)
Exercise: Monitoring and Tuning Streaming Applications
"Sleeping on Purpose" in map
to slow down processing
Spark Streaming and Checkpointing (for fault tolerance and exactly-once delivery)
Exercise: Start StreamingContext from Checkpoint
State Management in Spark Streaming (Stateful Operators)
Exercise: Use mapWithState
for stateful computation
Split lines into username and message to collect messages per user
Spark Streaming and Windowed Operators
Exercise: ???
Spark "Installation" and Your First Spark Application (using spark-shell)
Exercise: Counting Elements in Distributed Collection
SparkContext.parallelize
SparkContext.range
SparkContext.textFile
Using Spark’s Core APIs in Scala - Transformations and Actions
Exercise: Processing lines in README.md
filter
, map
, flatMap
, foreach
Exercise: Gotchas with Transformations like zipWithIndex
or sortBy
It may or may not submit a Spark job
Apply to RDDs of different number of partitions
Use webUI to see completed jobs
Using key-value pair operators
Exercise: Key-value pair operators
cogroup
flatMapValues
aggregateByKey
Exercise: Word Counter = Counting words in README.md
Building, Deploying and Monitoring Spark Applications (using sbt, spark-submit, and web UI)
Exercise: A Complete Development Cycle of Spark Application
Processing Structured Data using RDDs
Traditional / Old-Fashioned Approach
Exercise: Accessing Data in CSV
Partitions
mapPartitionsWithIndex
and foreachPartition
Example: FIXME
Accumulators
Exercise: Distributed Counter
Exercise: Using Accumulators and cogroup
to Count Non-Matching Records as in leftOuterJoin
Ensure exactly-one processing despite task failures
Use TaskContext
to track tasks
Exercise: Custom Accumulators
AccumulatorParam
Broadcast Variables
Submitting Spark Applications
run-example
spark-submit
Specifying memory requirements et al.
Exercise: Executing Spark Examples using run-example
Exercise: Executing Spark Example using spark-submit
Application Log Configuration
conf/log4.properties
RDD-based Graph API
spark-shell --packages graphframes:graphframes:0.1.0-spark1.6
Exercise: Stream Processing using Spark Streaming, Spark SQL and Spark MLlib (Pipeline API).
Training classes are best for groups up to 8 participants.
Experience in software development using modern programming language (Scala, Java, Python, Ruby) is recommended. The workshop introduces Scala only enough to develop Spark applications using Scala API.
Participants have decent computers, preferably with Linux or Mac OS operating systems
Participants have to download the following packages to their computers before the class:
Optional downloads (have them ready):
The programming language to use during the course is . There is a one-day "crash course" to the language during the workshop. It is optional for skilled Scala developers who are familiar with the fundamentals of Scala and sbt.
This module requires Internet access to download sbt and plugins (unless you git clone
the repository - see ).
The scaladoc of
in the official documentation.
on StackOverflow
in the official documentation.
Community Packages for Apache Spark
Exercise: Accessing Data in Apache Cassandra using
: DataFrame-based Graphs
(mostly with Spark SQL / Hive).
(or equivalent development environment)
Install
- download
by executing the following command: $SPARK_HOME/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.10
(optional) (or later) and (or later)
Participants are requested to git clone
this project and follow .