Flow

If you’d like to follow along, now is the time to start up your Spark shell. You can start a local cluster in the spark-in-action VM or connect to a Spark standalone, YARN, or Mesos cluster, if it’s available to you (for details, please see chapter 10, 11, or 12).

In any case, make sure to have more than one core available to your executors, because each Spark Streaming receiver has to use one core (technically, it’s a thread) for reading the incoming data, and at least one more core needs to be available for performing the calculations of your program.

  1. Open the spark shell $ spark-shell --master local[4]

  2. Once your shell is up, the first thing you need to do is to create an instance of Streaming-Context. From your Spark shell, you instantiate it using the SparkContext object (available as variable sc) and a Duration object, which specifies time intervals at which Spark Streaming should split the input stream data and create mini-batch RDDs.

  3. For streaming the incoming textual data directly from files, StreamingContext provides the textFileStream method, which monitors a directory (any Hadoop-compliant directory, such as HDFS, S3, GlusterFS, and local directories) and reads each newly created file in the directory. The method takes only one argument: the name of the directory to be monitored.

    1. Newly created means it won’t process the files that already exist in the folder when the streaming context starts, nor will it react to data that’s added to a file. It will process only the files copied to the folder after processing starts.

  4. Creating a DStream object valfileDstream= ssc.textFileStream("/home/spark/ch06input") The resulting fileDstream object is an instance of class DStream.

    1. DStream (which stands for “discretized stream”) is the basic abstraction in Spark Streaming, representing a sequence of RDDs, periodically created from the input stream.

    2. Needless to say, DStreams are lazily evaluated, just like RDDs. So when you create a DStream object, nothing happens yet. The RDDs will start coming in only after you start the streaming context.

  5. Finally, you get to see the fruit of your labor. Start the streaming computation by issuing the following command: ssc.start().

    1. This starts the streaming context, which evaluates the DStreams it was used to create, starts their receivers, and starts running the programs the DStreams represent. In the Spark shell, this is all you need to do to run the streaming computation of your application. Receivers are started in separate threads, and you can still use your Spark shell to enter and run other lines of code in parallel with the streaming computation.

    2. Although you can construct several StreamingContext instances using the same SparkContext object, you can’t start more than one StreamingContext in the same JVM at a time.

Last updated