Data Locality with RDDs

By default, Spark tries to read data into an RDD from the nodes that are close to it. Because Spark usually accesses distributed partitioned data (such as data from HDFS or S3), to optimize transformation operations, it creates partitions to hold the underlying blocks from the distributed file system.

Figure 6.1 depicts how blocks from a file in a distributed file system (like HDFS) are used to create RDD partitions on workers, which are co-located with the data.

PreviousLoading Data into RDDs NextHow Many Partitions Does An RDD Have

Last updated 5 years ago

Was this helpful?