Spark Streaming

Spark Streaming takes blocks of data, which come in specific time periods, and packages them as RDDs

data can come into a Spark Streaming job from various external systems. These include filesystems and TCP/IP socket connections, but also other distributed systems, such as Kafka, Flume, Twitter, and Amazon Kinesis.

Different Spark Streaming receiver implementations exist for different sources (data from some data sources is read without using receivers, but let’s not complicate things too early).

Receivers know how to connect to the source, read the data, and forward it further into Spark Streaming. Spark Streaming then splits the incoming data into mini-batch RDDs, one mini-batch RDD for one time period, and then the Spark application processes it according to the logic built into the application. During mini-batch processing, you’re free to use other parts of the Spark API, such as machine learning and SQL. The results of computations can be written to filesystems, relational databases, or to other distributed systems.

PreviousTungsten NextNotes

Last updated 5 years ago

Was this helpful?