Skip to main content

Apache Spark - An introduction

 

Need for Spark:

With the amount of data forever increasing, the need to process it – so that it becomes information that can be acted upon - increases as well. The computers that we are familiar with, ex: the one that I am typing this article with, are not really suitable to process huge amounts of data. Even if single computers are able to tolerate the load they might not meet the time restrictions for the processing. That is where clusters come into picture. 


Clusters are nothing but multiple computers grouped together. Each computer that is part of the cluster is called a node. Cluster can in some ways be  compared to a company, which is ready to serve something to the society (analogous to processing of data). Primarily a company is a collection of employees. Where the projects of huge sizes are taken up and completed by groups of employees. This is because we know: 


Alone we can do so little; together we can do so much.

                                                                -Helen Keller


Similarly in the realm of big data processing, clusters take advantage of the combined work done by each computer. Now that there is required hardware, groups of computers, ready to do the work, there is a need for something to take care of assigning and coordinating  work between the group of machines. That is where Spark comes in, primarily designed for big data analytical workloads. 


What is Spark?:


Spark is nothing but a unified compute engine and set of libraries to assign and coordinate parallel data processing tasks.


Let us dissect the definition given above by starting with “unified”, which is one Spark’s design principle*. In the realm of big data analytical workload is diverse and some of the common ones are:


  1. Batch Processing

  2. Stream Processing

  3. Machine Learning


Spark is a compute engine that provides capabilities to service these different workloads by providing consistent APIs. Since it acts like a one-stop shop for different needs, it can be described as unified.



Fig.1

 Just like your local supermarket is, it not only sells groceries but also stationeries, crockery and more.It is a unified market which addresses different needs of the customer. 


The next part of the definition was: “ compute engine”. As the name suggests, spark is just a compute engine. It does not persist data after the work is done. This might not seem intuitive for people not familiar with previous big data software platforms such as Hadoop. Hadoop came as a close coupling of storage and compute systems. This is caused by and causes decoupling of storage and compute. Load the data when there is a need, work on it and then persist results elsewhere. Since Spark is just a compute engine, a powerful one obviously, it comes with support for many storage systems including Hadoop. Much more storage systems are supported either as built-in or extended by the open source community.


Now let’s dig into the last part of the definition which was “libraries”. If you are familiar with programming, then you might be aware of libraries. The libraries are built based on design principles, of which a major one was discussed above. These libraries are not limited only to the ones that are pre-packaged, Spark ecosystem benefits day to day by its open source community. Spark provides these libraries out-of-the-box for : 

Library

Workload

SparkSQL

SQL and Structured data

MLlib

Machine Learning

Spark Streaming/Structured Streaming API

Stream processing

GraphX

Graph Analytics


Conclusion:


This article should have given you the need for Spark and what exactly is Spark. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀







Comments

Popular posts from this blog

Spark Streaming APIs

  Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs.  The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Sp...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Joins in Spark

  Joins are a crucial transformation when it comes to data. It allows us to enrich the data that we currently have with some more data. As the name suggests, the join() method allows us to join two dataframes. We are allowed to define what strategy is to be used and on what conditions the dataframes are to be joined. I had given a brief introduction in this article , if you have not checked it out I kindly request you to do so. The join() method takes three arguments: other on how The other should be an object of Class DataFrame , argument on should be either a condition or a list of column(s) or column name(s). Argument how determines what join strategy is to be used.  The how argument accepts a variety of string values that should be one of the following: Allowed Values for how Diagram inner cross outer,full,fullouter,full_outer left,leftouter,left_outer right,rightouter,right_outer semi,left_semi,leftsemi Note: Same as inner join, but the final dataframe has only colu...