Skip to main content

Execution in Apache Spark


Introduction:

In this article I would like to elaborate on what goes behind the scenes when we submit a Spark job. 

Whenever we submit a Spark job, there are four major entities involved:

  1. Driver

  2. Executors

  3. Spark Session

  4. Cluster Manager

Th way in which these entities are arranged and communicate constitute the Spark internal architecture.

The Driver and Executors together can be called a Spark Application. Let’s take a look at what these components are and then talk about what communication goes on between these components.


Cluster Manager:

Cluster Manager is a pluggable component, meaning there are multiple options from which we can choose based on our needs.Cluster Manager is responsible for allocating resources(CPU) for our job.


Spark Session:

Spark Session is an multifaceted object that we instantiate in our code, that acts as an entry point to our program. Spark session object is the handle with which we define what is to be done. With Spark Session we can do a variety of operations from reading, performing SQL to configuring Spark. 


Driver Program:

The code that we submit becomes a standalone program, when the Spark Session is instantiated in our code we can call the program as Driver program. Driver program has three roles:

  1. Maintain information about our spark application

  2. Respond to user program/input.

  3. Devise plan for execution

  4. Schedule work on the Executors


Executors:

Executors are Java processes that run on worker nodes. They are responsible for actually doing work/operations. Each executor has resources allocated to it. Usually executors live till the end of our Spark Application. But the Driver can ask for the Executor to be shut down based on some conditions(Dynamic allocation).


Fig 1


How do these components interact?:

On a high-level:

  1. When our code is submitted it creates a standalone application and the main method of our code is executed.

  2. Once the SparkSession is instantiated in our code, the program becomes the “Driver”.

  3. The Driver will contact the cluster manager(with the help of SparkSession), asking for resources to run your operations.

  4. Cluster manager launches executor on behalf of the Driver program.

  5. Once the launch is successful, executors register themselves to the driver to enable  direct communication.

  6. Based on the operations specified in the user program(operations), the driver devises various plans and schedules work on the executors as tasks.

  7. The executors compute and save/send the results to the Driver.

  8. The driver program exits if the main method exits or the spark session is explicitly stopped in the user program.

  9. The executors are shut down and the resources are released.

Conclusion:

This article should have given you some idea of execution in Spark. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 

Happy Learning! 😀




Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Structured API Execution

  Spark Structured API execution can be split up into 3 high level steps: The syntax of the code is checked (common for any code) If the code is found to be valid, Spark converts this to a Logical Plan This Logical plan is then converted to Physical plan. In this process, any optimizations that are possible are applied. Spark executes this Physical Plan(RDD manipulations) on the cluster The first step does not just apply to Spark but to all types of programs. Python checks for syntax of the code written, and fails the job if there is any violation of syntax. The second step is what I would describe as Spark-specific. In this step, a sequence of three types of Logical plans are created, with the most optimal one at the end. Logical plans are abstract, with no clarity on low-level execution details. We could imagine it to be similar to a flow chart that gives a high-level idea about how operations are to be performed and in what order. As the high-level is a good start, a detailed pl...