Skip to main content

Apache Spark - An introduction

 

Need for Spark:

With the amount of data forever increasing, the need to process it – so that it becomes information that can be acted upon - increases as well. The computers that we are familiar with, ex: the one that I am typing this article with, are not really suitable to process huge amounts of data. Even if single computers are able to tolerate the load they might not meet the time restrictions for the processing. That is where clusters come into picture. 


Clusters are nothing but multiple computers grouped together. Each computer that is part of the cluster is called a node. Cluster can in some ways be  compared to a company, which is ready to serve something to the society (analogous to processing of data). Primarily a company is a collection of employees. Where the projects of huge sizes are taken up and completed by groups of employees. This is because we know: 


Alone we can do so little; together we can do so much.

                                                                -Helen Keller


Similarly in the realm of big data processing, clusters take advantage of the combined work done by each computer. Now that there is required hardware, groups of computers, ready to do the work, there is a need for something to take care of assigning and coordinating  work between the group of machines. That is where Spark comes in, primarily designed for big data analytical workloads. 


What is Spark?:


Spark is nothing but a unified compute engine and set of libraries to assign and coordinate parallel data processing tasks.


Let us dissect the definition given above by starting with “unified”, which is one Spark’s design principle*. In the realm of big data analytical workload is diverse and some of the common ones are:


  1. Batch Processing

  2. Stream Processing

  3. Machine Learning


Spark is a compute engine that provides capabilities to service these different workloads by providing consistent APIs. Since it acts like a one-stop shop for different needs, it can be described as unified.



Fig.1

 Just like your local supermarket is, it not only sells groceries but also stationeries, crockery and more.It is a unified market which addresses different needs of the customer. 


The next part of the definition was: “ compute engine”. As the name suggests, spark is just a compute engine. It does not persist data after the work is done. This might not seem intuitive for people not familiar with previous big data software platforms such as Hadoop. Hadoop came as a close coupling of storage and compute systems. This is caused by and causes decoupling of storage and compute. Load the data when there is a need, work on it and then persist results elsewhere. Since Spark is just a compute engine, a powerful one obviously, it comes with support for many storage systems including Hadoop. Much more storage systems are supported either as built-in or extended by the open source community.


Now let’s dig into the last part of the definition which was “libraries”. If you are familiar with programming, then you might be aware of libraries. The libraries are built based on design principles, of which a major one was discussed above. These libraries are not limited only to the ones that are pre-packaged, Spark ecosystem benefits day to day by its open source community. Spark provides these libraries out-of-the-box for : 

Library

Workload

SparkSQL

SQL and Structured data

MLlib

Machine Learning

Spark Streaming/Structured Streaming API

Stream processing

GraphX

Graph Analytics


Conclusion:


This article should have given you the need for Spark and what exactly is Spark. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀







Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Structured API Execution

  Spark Structured API execution can be split up into 3 high level steps: The syntax of the code is checked (common for any code) If the code is found to be valid, Spark converts this to a Logical Plan This Logical plan is then converted to Physical plan. In this process, any optimizations that are possible are applied. Spark executes this Physical Plan(RDD manipulations) on the cluster The first step does not just apply to Spark but to all types of programs. Python checks for syntax of the code written, and fails the job if there is any violation of syntax. The second step is what I would describe as Spark-specific. In this step, a sequence of three types of Logical plans are created, with the most optimal one at the end. Logical plans are abstract, with no clarity on low-level execution details. We could imagine it to be similar to a flow chart that gives a high-level idea about how operations are to be performed and in what order. As the high-level is a good start, a detailed pl...