spark notes
  • Introduction
  • Databricks
  • Concepts
  • Spark Execution Flow
    • SparkContext and SparkSession
  • Resilient Distributed Dataset (RDD)
    • Caching
    • Pair RDDs
    • Transformations
      • Depedency Resolution
    • Actions
    • Persistence
    • RDD lineage
    • Types of RDDs
    • Loading Data into RDDs
    • Data Locality with RDDs
    • How Many Partitions Does An RDD Have
  • Spark job submission breakdown
  • Why Cluster Manager
  • SparkContext and its components
  • Spark Architecture
    • Stages
    • Tasks
    • Executors
    • RDD
    • DAG
    • Jobs
    • Partitions
  • Spark Deployment Modes
  • Running Modes
  • Spark Execution Flow
  • DataFrames, Datasets,RDDs
  • SparkSQL
    • Architecture
    • Spark Session
  • Where Does Map Reduce Does not Fit
  • Actions
    • reduceByKey
    • count
    • collect, take, top, and first Actions
    • take
    • top
    • first
    • The reduce and fold Actions
  • DataSets
  • Spark Application Garbage Collector
  • How Mapreduce works in spark
  • Notes
  • Scala
  • Spark 2.0
  • Types Of RDDs
    • MapPartitionsRDD
  • Spark UI
  • Optimization
    • Tungsten
  • Spark Streaming
    • Notes
    • Flow
  • FlatMap - Different Variations
  • Examples
  • Testing Spark
  • Passing functions to Spark
  • CONFIGURATION, MONITORING, AND TUNING
  • References
Powered by GitBook
On this page

Was this helpful?

  1. Actions

The reduce and fold Actions

The reduce and fold actions are aggregate actions, each of which executes a commutative and associative operation (such as summing a list of values, for example) against an RDD.

Commutative and associative are the operative terms here. This makes the operations independent of the order in which they run, and this is integral to distributed processing because the order cannot be guaranteed.

The commutative and associative characteristics are summarized here in general form:

commutative ⇒ x + y = y + x associative ⇒ (x + y) + z = x + (y + z)

Take a look at the main Spark actions that perform aggregations.

reduce()

Syntax:

RDD.reduce(<function>)

The reduce action reduces the elements of an RDD using a specified commutative and associative operator. The function supplied specifies two inputs (lambda x, y: ...) that represent values in a sequence from the specified RDD.

numbers = sc.parallelize([1,2,3,4,5,6,7,8,9])

numbers.reduce(lambda x, y: x + y)

# 45

reduceByKey is a transformation that returns an RDD, whereas reduce is an action that returns a value.

PreviousfirstNextDataSets

Last updated 5 years ago

Was this helpful?