Loading Data into RDDs
Loading data from a file or files
Loading data from a datasource (such as a SQL or NoSQL datastore)
Loading data programatically
Loading data from a stream
Creating an RDD Programatically
It’s also possible to create an RDD in process from data in your program(lists, arrays, or collections). The data from your collection is partitioned and distributed in much the same way as the previous methods discussed.
However, creating RDDs this way can be limiting because it requires all of the dataset to exist or be created in memory on one system.
The following methods allow you to create RDDs from lists in your program:
parallelize
range
parallelize()
Syntax:
sc.parallelize(c, numSlices=None)
The parallelize() method assumes that you have a list created already and you supply this as the c (for collection) argument. The numSlices argument specifies the desired number of partitions to create. An example of the parallelize() method is shown
parallelrdd = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8])
parallelrdd
# notice the type of RDD created
# ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423
parallelrdd.min()
# will return 0 as this was the min value in our list
parallelrdd.max()
# will return 8 as this was the max value in our list
parallelrdd.collect()
# will return the parallel collection as a list
# [0, 1, 2, 3, 4, 5, 6, 7, 8]
range()
Syntax:
sc.range(start, end=None, step=1, numSlices=None)
The range() method will generate a list for you and create and distribute the RDD. The start, end, and step arguments define the sequence of values and the numSlices specifies the desired number of partitions.
# create an RDD using the range() function
# with 1000 integers starting at 0 in increments of 1
# across 2 partitions
rangerdd = sc.range(0, 1000, 1, 2)
rangerdd
# note the PythonRDD type, as range is a native Python
function
# PythonRDD[1] at RDD at PythonRDD.scala:43
rangerdd.getNumPartitions()
# should return 2 as we requested numSlices=2
rangerdd.min()
# should return 0 as this was out start argument
rangerdd.max()
# should return 999 as this is 1000 increments of 1 starting from 0
rangerdd.take(5)
# should return [0, 1, 2, 3, 4]
Last updated
Was this helpful?