Pair RDDs

Pair RDDs

Pair RDDs behave pretty similar to basic RDDs. The main difference is that RDD elements are key value pairs, and key value pairs are a natural fit for distributed computing problems.

Pair RDDs are heavily used to perform various kinds of aggregations and initial Extract Transform Load procedure steps

Performance-wise there’s one important, rarely mentioned fact about

them: Pair RDDs don’t spill on disk. Only basic RDDs can spill on disk.

means that a single Pair RDD must fit into computer memory. If Pair RDD content is larger than the size of the smallest amount of RAM in the cluster, it can’t be processed.

Last updated