Data Locality with RDDs
Last updated
Was this helpful?
Last updated
Was this helpful?
By default, Spark tries to read data into an RDD from the nodes that are close to it. Because Spark usually accesses distributed partitioned data (such as data from HDFS or S3), to optimize transformation operations, it creates partitions to hold the underlying blocks from the distributed file system.
Figure 6.1 depicts how blocks from a file in a distributed file system (like HDFS) are used to create RDD partitions on workers, which are co-located with the data.