Apache Spark - An introduction

Need for Spark:

With the amount of data forever increasing, the need to process it – so that it becomes information that can be acted upon - increases as well. The computers that we are familiar with, ex: the one that I am typing this article with, are not really suitable to process huge amounts of data. Even if single computers are able to tolerate the load they might not meet the time restrictions for the processing. That is where clusters come into picture.

Clusters are nothing but multiple computers grouped together. Each computer that is part of the cluster is called a node. Cluster can in some ways be compared to a company, which is ready to serve something to the society (analogous to processing of data). Primarily a company is a collection of employees. Where the projects of huge sizes are taken up and completed by groups of employees. This is because we know:

Alone we can do so little; together we can do so much.

-Helen Keller

Similarly in the realm of big data processing, clusters take advantage of the combined work done by each computer. Now that there is required hardware, groups of computers, ready to do the work, there is a need for something to take care of assigning and coordinating work between the group of machines. That is where Spark comes in, primarily designed for big data analytical workloads.

What is Spark?:

Spark is nothing but a unified compute engine and set of libraries to assign and coordinate parallel data processing tasks.

Let us dissect the definition given above by starting with “unified”, which is one Spark’s design principle*. In the realm of big data analytical workload is diverse and some of the common ones are:

Batch Processing
Stream Processing
Machine Learning

Spark is a compute engine that provides capabilities to service these different workloads by providing consistent APIs. Since it acts like a one-stop shop for different needs, it can be described as unified.

Fig.1

Just like your local supermarket is, it not only sells groceries but also stationeries, crockery and more.It is a unified market which addresses different needs of the customer.

The next part of the definition was: “ compute engine”. As the name suggests, spark is just a compute engine. It does not persist data after the work is done. This might not seem intuitive for people not familiar with previous big data software platforms such as Hadoop. Hadoop came as a close coupling of storage and compute systems. This is caused by and causes decoupling of storage and compute. Load the data when there is a need, work on it and then persist results elsewhere. Since Spark is just a compute engine, a powerful one obviously, it comes with support for many storage systems including Hadoop. Much more storage systems are supported either as built-in or extended by the open source community.

Now let’s dig into the last part of the definition which was “libraries”. If you are familiar with programming, then you might be aware of libraries. The libraries are built based on design principles, of which a major one was discussed above. These libraries are not limited only to the ones that are pre-packaged, Spark ecosystem benefits day to day by its open source community. Spark provides these libraries out-of-the-box for :

Library	Workload
SparkSQL	SQL and Structured data
MLlib	Machine Learning
Spark Streaming/Structured Streaming API	Stream processing
GraphX	Graph Analytics

Conclusion:

This article should have given you the need for Spark and what exactly is Spark. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.

Happy Learning! 😀

0x3eb

Search This Blog

Apache Spark - An introduction

Comments

Post a Comment

Popular posts from this blog

Spark Streaming APIs

Introduction to Structured Streaming

Joins in Spark