Skip to main content

Stream Processing

Stream processing involves reading from at least one unbounded source. Unbounded sources are sources which do not have any defined beginning or end. Data that keeps arriving is continuously considered in the computation of the result. Whereas the normal batch processing involves reading the fixed-size data completely and only then following the read process with computations. The results are generated once for every operation in a job. In the case of an unbounded source, as the data does not have a determined end, the processing system cannot wait for the entire data to arrive in order to process. To tackle the problem that unbounded source systems pose, stream processing systems provide multiple approaches, which will be much of the focus in this article.  


One common processing pattern involves logically group elements into fixed-interval windows based on the time that they were created(in the source system)/received(in the processing system). Generally, for each window, the computation will be performed so there will be much more outputs for the same computation. Cases of stream processing might include joining the unbounded data with a batch data source as well. On this note, I believe it would be better to understand some stream processing use cases before diving into different types of solutions.


Use Cases: 

The notion of stream processing is applicable in situations where time is of the essence.  Wherever there is an immediate need for processed data, within a short time that the source data arrives, an implementation of stream processing is required. One major use case that requires stream processing can be notifications and alerting. In the domain of banking, the transactions exceeding a certain amount warrant a notification to the account holder as soon as the transaction occurs. The time between the transaction and the notification is very crucial based on which further actions can be taken by the account holder, in case of unauthorized transactions. Real-time reporting after a product launch requires the data to be processed as soon as it arrives. Real-time reporting enables real-time decision making, especially in case of credit card transactions. The data can be reported to an application or ML model which can then determine if the activity seems to be fraudulent and based on which, the card can be blocked.  Real-time reporting is central to most of the online games, based on which the organizations can analyze user actions and behavior. 


Advantages:

Before getting into the advantages of stream processing, it is important to know two highly used terms in the domain of stream processing. 


Latency: is also known as delay between the arrival of data and the time that it actually gets processed


Throughput: also known as speed number of records or messages that the system can process in a given time


Stream processing in general enables lower latency when compared to the traditional batch processing. So the applications that need to respond quickly can benefit from this advantage offered by Stream processing. Whereas batch processing provides us with higher throughput than many streaming systems. So there is a tradeoff between latency and throughput along extremes of stream and batch processing. Stream processing can also be more efficient in updating a result than repeated batch jobs. Because the streaming job runs continuously, the job “incrementalizes” the computation by storing state which acts as a substitute for re-reading the entire data for every new data arrival, which is the case in batch jobs. The efficiency is conspicuous when the data source has a high velocity characteristic. Implementing incremental logic in stream processing is almost effortless. to achieve the same performance with batch, we would have to implement the incremental logic by hand and from scratch.


All these advantages do not come without challenges in achieving them. Stream processing is challenging due to many factors. The following sections scratches the surface of such challenges.


Challenges


Majority of the stream processing use cases are concerned about the time when a/an record or event was generated, rather than when they are actually processed. This is mostly the case in batch data as well. But dealing with this time, which can be called event time, can become difficult based on our source and the delivery of the event. Since none of the systems, for instance a system that generates the event or a system that is part of the  transmission of an event,  are not ideal the events or records are bound to be delivered with a delay from the time that they were generated. It will all be simple if the problem ends there, but unfortunately there are other issues faced as well. One such problem is when the messages arrive in a different order than the one in which they were created. So processing late and out-of-order data based on application timestamps(event time), become some problems that the application developer should address. The problem is worsened when the business logic that we have depends on the order of the data.  For example, we might want to order the records by event timestamp column then the order becomes crucial. In the above case we should also decide till when the processing should wait before performing the ordering on the subset of data. There should be a mechanism to define late data and a threshold based on which the records can be accepted as late data. 


Before we get into the next challenge, it is important to understand what a stateful operation is. Stateful operation is an operation, whose one row of result requires multiple rows to be processed. One such example will be sum() operations. The sum of a column over a group/partition/window returns one row for each group/partition/window. This group/partition/window contains a set of records. Windows, which were briefly discussed above, apply to stream processing and as each record for the window arrives, the sum is calculated. Each intermediate sum value generated is called the state for that window. In other words, state is intermediate data stored based on the records or events that had already arrived. Maintaining state becomes necessary in many operations, and are not limited to aggregations. As the records accumulate, mainly based on either the amount of data received and the system , maintaining state might become difficult as time passes. 


When the throughput of incoming data is high, especially because of the rate at which the source system generates events is high, our stream processing application and infrastructure in which it runs should be resilient enough to deal with it. Handling and supporting high-throughput data is one of the biggest challenges posed to stream processing. High-throughput data might be very common in gaming organizations. One common solution is to process the data for shorter intervals, but the efficiency will majorly depend on our infrastructure. When the processing does not take a happy path, i.e. not being able to deal with the load, we have to determine how to handle some of the workers that might be falling behind as well. 


Some other generic challenges include: determining how to update/write to output sinks transactionally as new events arrive, and trying to process each event exactly once despite machine failures. 


Conclusion:


This article discussed and introduced many aspects of stream processing. In the follow-up article I will be introducing different Stream processing models that can be considered as solutions to different requirements. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀




Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Spark Streaming APIs

  Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs.  The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Sp...