Execution in Apache Spark

Introduction:

In this article I would like to elaborate on what goes behind the scenes when we submit a Spark job.

Whenever we submit a Spark job, there are four major entities involved:

Driver
Executors
Spark Session
Cluster Manager

Th way in which these entities are arranged and communicate constitute the Spark internal architecture.

The Driver and Executors together can be called a Spark Application. Let’s take a look at what these components are and then talk about what communication goes on between these components.

Cluster Manager:

Cluster Manager is a pluggable component, meaning there are multiple options from which we can choose based on our needs.Cluster Manager is responsible for allocating resources(CPU) for our job.

Spark Session:

Spark Session is an multifaceted object that we instantiate in our code, that acts as an entry point to our program. Spark session object is the handle with which we define what is to be done. With Spark Session we can do a variety of operations from reading, performing SQL to configuring Spark.

Driver Program:

The code that we submit becomes a standalone program, when the Spark Session is instantiated in our code we can call the program as Driver program. Driver program has three roles:

Maintain information about our spark application
Respond to user program/input.
Devise plan for execution
Schedule work on the Executors

Executors:

Executors are Java processes that run on worker nodes. They are responsible for actually doing work/operations. Each executor has resources allocated to it. Usually executors live till the end of our Spark Application. But the Driver can ask for the Executor to be shut down based on some conditions(Dynamic allocation).

Fig 1

How do these components interact?:

On a high-level:

When our code is submitted it creates a standalone application and the main method of our code is executed.
Once the SparkSession is instantiated in our code, the program becomes the “Driver”.
The Driver will contact the cluster manager(with the help of SparkSession), asking for resources to run your operations.
Cluster manager launches executor on behalf of the Driver program.
Once the launch is successful, executors register themselves to the driver to enable direct communication.
Based on the operations specified in the user program(operations), the driver devises various plans and schedules work on the executors as tasks.
The executors compute and save/send the results to the Driver.
The driver program exits if the main method exits or the spark session is explicitly stopped in the user program.
The executors are shut down and the resources are released.

Conclusion:

This article should have given you some idea of execution in Spark. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.

Happy Learning! 😀

0x3eb

Search This Blog

Execution in Apache Spark

Comments

Post a Comment

Popular posts from this blog

Rows in Spark 101

Introduction to Structured Streaming

Spark Streaming APIs