Stages

Recall that Spark lazily evaluates transformations; transformations are not executed until an action is called. As mentioned previously, a job is defined by calling an action. The action may include one or several transformations, and wide transformations define the breakdown of jobs intostages.

Each stage corresponds to a shuffle dependency created by a wide transformation in the Spark program. At a high level, one stage can be thought of as the set of computations (tasks) that can each be computed on one executor without communication with other executors or with the driver. In other words, a new stage begins whenever network communication between workers is required; for instance, in a shuffle.

These dependencies that create stage boundaries are calledShuffleDependencies. As we discussed in “Wide Versus Narrow Dependencies”, shuffles are caused by those wide transformations, such assortorgroupByKey, which require the data to be redistributed across the partitions.Several transformations with narrow dependencies can be grouped into one stage.

As we saw in the word count example where we filtered stop words (Example 2-2), Spark can combine theflatMap,map, and

filtersteps into one stage since none of those transformations requires a shuffle. Thus, each executor can apply theflatMap,

map, andfiltersteps consecutively in one pass of the data.

Last updated