Skip to main content

Spark Streaming APIs

 

Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs. 

The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Spark can apply is lesser when compared to batch dataframes or datasets.


The latest stream processing API in Spark is called Structured Streaming. The name is structured because the API is built on the familiar Structured APIs. Structured streaming is available in all languages in which structured processing is supported. Because structured streaming is based on Structured APIs, they share many common advantages including the declarative style of operations, and cost-based optimizations. Structured streaming has support for event-time processing out of the box and until Spark 2.2 only micro-batch processing had been supported. From Spark 2.3, there is support of continuous processing. Because of the support that Spark provides for multiple stream processing patterns, managing a stream application is made simpler. One main characteristic of Structured streaming is that it is unified. Unified, in this context, means the operations that we want to perform can be expressed once for both stream and batch processing, then the transformations can be applied to both. Because of this characteristics, we can reuse the same code written for batch in stream and it allows us to build end-to-end continuous applications using Spark. These end-to-end applications include streaming, batch and interactive queries. What all the above means when we actually write code is, we write a normal dataframe(or SQL) computation and launch it on a batch or stream source. For stream, Structured streaming will automatically update the result of this computation in an incremental fashion as data arrives. This is a major help when writing end-to-end data applications. Developers do not need to maintain separate streaming versions of their batch code, which also prevents the developers from having to worry about maintaining the codes to be in sync. 


By now it should be evident that Structured Streaming is an upgrade of DStreams API, but that does not stop Structured Streaming from evolving. Structured Streaming is always evolving, so some of the facts stated in this article might not hold true in the coming future.


There are much more features that make Structured Streaming a strong consideration for a stream processing solution. Structured Streaming can read and output data from/to standard sources/sinks that are supported in Structured APIs. One advantage of the above feature is that it opens doors to make use of sinks for storing state. A conspicuous similarity between Structured Streaming and Structured API is that incoming data is treated as a table. To be more specific, Structured Streaming treats a stream of data as a table to which data is continuously appended. The job keeps checking for new data periodically. Once new data is received, it is processed. Post processing, if there are any updates required to the internal state maintained, the state is updated. After which the result itself is updated. So we can say that Structured Streaming takes care of the “incrementalization” of our query, remember the query is the same one as in batch. There are some limitations on what queries can be run on streaming data which will be covered in detail in another article. There are also some considerations that are to be done when designing a streaming application, such as dealing with out-of-order data. 


Because Spark provides solutions for various big-data problems, we can create an application that leverages all these solutions. So we can end up using one framework to satisfy different needs. These needs might be batch processing, stream processing and Machine learning workloads.


Conclusion;


I hope this article was a good place to start. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 



Happy Learning! 😀


Comments

Popular posts from this blog

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Joins in Spark

  Joins are a crucial transformation when it comes to data. It allows us to enrich the data that we currently have with some more data. As the name suggests, the join() method allows us to join two dataframes. We are allowed to define what strategy is to be used and on what conditions the dataframes are to be joined. I had given a brief introduction in this article , if you have not checked it out I kindly request you to do so. The join() method takes three arguments: other on how The other should be an object of Class DataFrame , argument on should be either a condition or a list of column(s) or column name(s). Argument how determines what join strategy is to be used.  The how argument accepts a variety of string values that should be one of the following: Allowed Values for how Diagram inner cross outer,full,fullouter,full_outer left,leftouter,left_outer right,rightouter,right_outer semi,left_semi,leftsemi Note: Same as inner join, but the final dataframe has only colu...