Tasks
Last updated
Was this helpful?
Last updated
Was this helpful?
A stage consists of tasks.The_task_is the smallest unit in the execution hierarchy, and each can represent one local computation. All of the tasks in one stage execute the same code on a different piece of the data. One task cannot be executed on more than one executor. However, each executor has a dynamically allocated number of slots for running tasks and may run many tasks concurrently throughout its lifetime.The number of tasks per stage corresponds to the number of partitions in the output RDD of that stage.
A cluster cannot necessarily run every task in parallel for each stage. Each executor has a number of cores.
The number of cores per executor is configured at the application level, but likely corresponding to the physical cores on a cluster.
Spark can run no more tasks at once than the total number of executor cores allocated for the application. We can calculate the number of tasks from the settings from the Spark Conf as (total number of executor cores = # of cores per executor × number of executors). If there are more partitions (and thus more tasks) than the number of slots for running tasks, then the extra tasks will be allocated to the executors as the first round of tasks finish and resources are available. In most cases, all the tasks for one stage must be completed before the next stage can start.
The process of distributing these tasks is done by the TaskScheduler
and varies depending on whether the fair scheduler or FIFO scheduler is used