Loading Data into RDDs

  • Loading data from a file or files

  • Loading data from a datasource (such as a SQL or NoSQL datastore)

  • Loading data programatically

  • Loading data from a stream

Creating an RDD Programatically

It’s also possible to create an RDD in process from data in your program(lists, arrays, or collections). The data from your collection is partitioned and distributed in much the same way as the previous methods discussed.

However, creating RDDs this way can be limiting because it requires all of the dataset to exist or be created in memory on one system.

The following methods allow you to create RDDs from lists in your program:

  • parallelize

  • range

parallelize()

Syntax:

sc.parallelize(c, numSlices=None)

The parallelize() method assumes that you have a list created already and you supply this as the c (for collection) argument. The numSlices argument specifies the desired number of partitions to create. An example of the parallelize() method is shown

parallelrdd = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8])

parallelrdd

# notice the type of RDD created

# ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423

parallelrdd.min()

# will return 0 as this was the min value in our list

parallelrdd.max()

# will return 8 as this was the max value in our list

parallelrdd.collect()

# will return the parallel collection as a list

# [0, 1, 2, 3, 4, 5, 6, 7, 8]

range()

Syntax:

sc.range(start, end=None, step=1, numSlices=None)

The range() method will generate a list for you and create and distribute the RDD. The start, end, and step arguments define the sequence of values and the numSlices specifies the desired number of partitions.

# create an RDD using the range() function

# with 1000 integers starting at 0 in increments of 1

# across 2 partitions

rangerdd = sc.range(0, 1000, 1, 2)

rangerdd

# note the PythonRDD type, as range is a native Python

function

# PythonRDD[1] at RDD at PythonRDD.scala:43

rangerdd.getNumPartitions()

# should return 2 as we requested numSlices=2

rangerdd.min()

# should return 0 as this was out start argument

rangerdd.max()

# should return 999 as this is 1000 increments of 1 starting from 0

rangerdd.take(5)

# should return [0, 1, 2, 3, 4]

Last updated