Skip to main content

Major APIs in Spark


 Introduction:

In this article, we would be uncovering the internal architecture of Spark. If you are not aware of Spark, I would recommend the preceding article as a good place to start.

We know that in a cluster of computers, the nodes can do some work, but how do we say:


  1. What work should be done?

  2. How should it be done?


The answer to the first question varies for different workloads, because the data that each workload deals with will differ and also what we need as a result of the work differs. We specify the work that needs to be done using code, written in one of the languages which support Spark, which has the answers of:


  1. What data needs to be operated on?

  2. What operations are to be performed?

  3. Where do we persist the data/model(in case of ML)? 

  4. What format data should be persisted in?

  5. What are the other miscellaneous tasks the application should perform?


 However, the answer to the second question is twofold. Spark’s ability is not just bridled to assign and coordinate work but also it can control how the work is to be performed. But if we wish, we have the option to specify how exactly a task should be done. These two options are made possible because of two major APIs that Spark provides. Which are:


  1. Structured APIs (high level)

  2. Unstructured APIs (low level)


Each API not only provides abstraction for the data that is stored but also for the operations that can be performed on them.


Before going into the details, it is important to note two points: 

  1. Both these APIs allow us to store distributed immutable collection of data. 

  2. They differ in the control that the developer has over the operations on the data and placement of the data.(Difference in the abstraction) 


Structured APIs:


Structured APIs can be considered high level due to the huge amount of abstraction that is done and the auto-optimization that goes behind the scenes. We are left to operate with simple, verbose commands that describe what to do but not how to do the work. Spark takes care of the “how” in an optimized way. One limitation is that the operations and functionality provided are limited until there is a new addition or alternative that can be found in the newer versions of Spark or from the open source community. Structured APIs can be compared to every other service provider we know. For example a cook, to whom we only say what we want and the cook takes care of the “how”, making the food as best as he/she can with his/her experience and knowledge. It is important to know that the Structured API is built on top of low level Unstructured APIs.


Structured APIs refer to three core types of distributed collections API:

  • Datasets API

  • Dataframes API

  • SQL tables and views


Where Datasets and Dataframes are table-like distributed collections, meaning they have rows and columns with a known structure. The data is split based on rows and distributed within the cluster. SQL tables and views are exactly as the name suggests. They are basically the same as DataFrames, we just execute SQL against them instead of Dataframe code. Structured APIs can be used to manipulate structured, semi-structured and unstructured data. The majority of the structured APIs can be leveraged for both batch and stream processing, with minimum to no effort needed to switch between the two. 



Unstructured APIs:


Unstructured APIs are considered to be low level because it provides extensive power to manipulate our data and as well its placement. Uncle Ben once said “With great power comes great responsibility.”, and he was spot on. Spark will not auto-optimize, we have the responsibility of manually performing any type of optimization. 


Unstructured APIs were the only available APIs till Spark 1.x. Later came the Structured APIs, as mentioned before were built on top of Unstructured APIs. The best practice is to stick to Structured APIs and only drop down to Unstructured APIs for operations/control that is not found in the former. Obvious use of Unstructured APIs would be if we want to maintain and add to an existing Spark legacy code that works on Unstructured APIs. 


Core distributed collections offered by Unstructured APIs are:

  • RDDs

  • Distributed Variables


RDDs expand as Resilient Distributed Datasets, they are immutable collections of distributed data. Meaning once an RDD is created it cannot be modified. We will cover the immutability property of Structured and Unstructured APIs collections in the Transformation and Actions article. 

Distributed variables are read only variables that can be cached on each node which might be used across multiple tasks. So instead of shipping the variable to each node when needed, which creates communication overheads, we can cache it in each node so that the mentioned communications overhead is minimized.


Conclusion:


This article should have given you some idea of the different API that Spark provides. If you have any

suggestions or questions please post it in the comment box. This article, very much like every article

in this blog, will be updated based on comments and as I find better ways of explaining things.

So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀



Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Structured API Execution

  Spark Structured API execution can be split up into 3 high level steps: The syntax of the code is checked (common for any code) If the code is found to be valid, Spark converts this to a Logical Plan This Logical plan is then converted to Physical plan. In this process, any optimizations that are possible are applied. Spark executes this Physical Plan(RDD manipulations) on the cluster The first step does not just apply to Spark but to all types of programs. Python checks for syntax of the code written, and fails the job if there is any violation of syntax. The second step is what I would describe as Spark-specific. In this step, a sequence of three types of Logical plans are created, with the most optimal one at the end. Logical plans are abstract, with no clarity on low-level execution details. We could imagine it to be similar to a flow chart that gives a high-level idea about how operations are to be performed and in what order. As the high-level is a good start, a detailed pl...