The last article introduced some common stream processing challenges. In this article, I will try to describe some stream processing designs that can fit as solutions for some of the common stream processing problems.
Programming:
How the stream processing system/application deals with different aspects of processing is important, and this is instructed by the program made by the developers. There are two different paradigms of programming that can be considered for the stream processing solutions. The two paradigms are known as imperative and declarative programming. Using both of these programming paradigms, we define what should be done. In imperative programming, we have to make the extra effort of describing how an operation should be carried out. Each of these paradigms are described below on how they fit into stream processing.
i.) Imperative Programming
This approach expects us to develop a custom application that tackles all the challenges discussed in the preceding article. The application developer is given full control that comes with total responsibility. Since control and responsibility are assigned to the developer, using this solution requires expertise. We have to define all the tasks and operations necessary for streaming applications and also how they should be performed. This was the approach followed in the earlier streaming systems and it is used when applications need full control over the processing of data. Downside and the upside of this approach is the same, crucial operations such as maintaining state are solely governed by the application. With an imperative API, we are responsible for tracking state over longer time periods, dropping it after some time to clear up space, and responding differently to duplicate events.
ii.) Declarative Programming
Because a general stream processing system developer might not need all the control and responsibilities that the imperative programming approach provides, there is a need for an alternative. Another reason for the need for a more generic and simple approach is that much of the stream processing operations are widely used. So it would be better to standardize how these operations are carried out and decrease the code that has to be re-written for the same functionality. As a result, many newer streaming systems provide declarative APIs. Our app specifies what to compute but not how to compute it. So streaming systems provide processing engines which take care of the “how” part of the computation.
Whether one should opt for the imperative or declarative programming approach depends on the use case that one has. If all requirements can be satisfied with the use of declarative approach, then it is better to opt for the same considering time and expertise as major factors. If the above is not the case, i.e. the requirements are not satisfied within the means of declarative programming approach, one can opt for imperative programming. Choice of imperative is common when the functionality that the use case demands is not covered by the generic declarative programming approach and also the functionality might require custom operations that demand control to be within the application developer. In my opinion, main factors in deciding on the programming style should be: time, expertise, level of control. There can also be a hybrid approach combining both styles when possible.
Time-based Processing:
In stream processing systems, there are two notions of time. The first is event time, which denotes time at which a record was created in the source. The second is processing time, which denotes the time, usually measured by the system clock, at which a record was processed. There are two processing patterns available based on the time that is used to process the records. These are intuitively known as event-time and processing-time processing.
Event-Time and Processing-Time Processing
For much of the streaming applications, one of the concerns is whether the streaming system natively supports event time processing. Because, the majority of the stream processing use cases are concerned about when these events happen and what are the events they are followed by, using event-time based processing is mandatory. The patterns of events might not be recognized when opting for processing-time based processing. Because the records that were sequentially created might be processed in different order. Irrespective of whether we use event-time or processing time, there are chances that the records might not arrive in the order that they were created. Dealing with the out-of-order, determined whether they are out-of-order is based on the event-timestamp, when opting for processing-time processing is redundant. Event time processing requires careful state tracking, as there are huge chances of out-of-order data. But when choosing event-time, the act of dealing with out-of-order data becomes important. There should be a standard way to determine when all the elements might have been received, in many streaming systems this mechanism is called as watermark. Based on this mechanism, the state will be cleared.
Once the system assumes that all elements might have arrived, the time interval is closed. Meaning, any item that arrives later with a timestamp that falls into the closed interval is considered to be late data and what happens to this late data can be decided by the application programmer. By default these late records are discarded and are not part of any computations. This part intentionally focused on the event-time processing as it is more prevalent and there are huge chances that even in your case you might be opting for event-time processing.
Execution
The final processing pattern revolves around when the data/events that arrive are processed. There are two options, process as they arrive or group the events for a small time period and process them as a “mini-batch”. The first approach is called continuous execution. The second approach is called Micro-batch execution.
Continuous Execution
In a stream processing system that opts for continuous executions, each node in the system is continually listening to messages from other nodes and outputting new updates to its child nodes. To give an example in the traditional MapReduce use case, let’s assume we are performing map-reduce over an input stream. Node that performs map reads records one by one from an input source. Computation is performed on the records as they arrive and the system sends them to the node that performs the reduce. The reducer would then update its state whenever it gets a new record. This sequece of steps happens iteratively on each record. As perceived, this approach leads to the least possible latency, because each message is processed immediately without waiting much for any other record to arrive. But disadvantage is that the overall throughput is low as the entire workflow is performed on each record. So on average the time it takes for an element to be processed is relatively high while using continuous execution when compared to the Mini-Batch approach.
Mini-Batch Execution
The throughput issue described above could have been prevented if the records were batched and the workflow were to be performed on the entire batch rather than on every record, which is the stream processing design that micro-batch systems follow. Mini-batch systems wait for a small interval(differs between workloads) of time to collect small batches of input data, then each batch is processed in parallel, similar to the execution of a batch job in Spark. This sequence of operations happens iteratively on different batches. Because the same optimizations that are applicable to batch systems can be applied in this case, micro-batch systems achieve high throughput. Because we are batching, nodes process batches and hence the number of records processed per node per time is higher. The problem with the micro-batching approach is its latency as each record has to wait for some time before being processed. The record that arrives first for a batch dealt with the highest latency. Large-scale streaming applications might prefer micro-batches because the input data arrives at a high-throughput, so processing these elements at a high-throughput is preferable. In this case if we opt for record-by-record processing, then the backlog of records could increase exponentially. Micro-batch approach can deliver latencies in the range of 100ms to 1s, depending on the application configuration. To achieve the same throughput as the record-by-record approach, micro-batches require fewer nodes, hence fewer costs. The considerations when choosing the type of execution could be cost of operation and maximum tolerable latency.
Conclusion:
This article introduced different Stream processing models that can be considered as solutions to different processing requirements. In the next article, I will cover Spark's streaming solutions. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.
Happy Learning! 😀
Comments
Post a Comment