Skip to main content

Execution in Apache Spark


Introduction:

In this article I would like to elaborate on what goes behind the scenes when we submit a Spark job. 

Whenever we submit a Spark job, there are four major entities involved:

  1. Driver

  2. Executors

  3. Spark Session

  4. Cluster Manager

Th way in which these entities are arranged and communicate constitute the Spark internal architecture.

The Driver and Executors together can be called a Spark Application. Let’s take a look at what these components are and then talk about what communication goes on between these components.


Cluster Manager:

Cluster Manager is a pluggable component, meaning there are multiple options from which we can choose based on our needs.Cluster Manager is responsible for allocating resources(CPU) for our job.


Spark Session:

Spark Session is an multifaceted object that we instantiate in our code, that acts as an entry point to our program. Spark session object is the handle with which we define what is to be done. With Spark Session we can do a variety of operations from reading, performing SQL to configuring Spark. 


Driver Program:

The code that we submit becomes a standalone program, when the Spark Session is instantiated in our code we can call the program as Driver program. Driver program has three roles:

  1. Maintain information about our spark application

  2. Respond to user program/input.

  3. Devise plan for execution

  4. Schedule work on the Executors


Executors:

Executors are Java processes that run on worker nodes. They are responsible for actually doing work/operations. Each executor has resources allocated to it. Usually executors live till the end of our Spark Application. But the Driver can ask for the Executor to be shut down based on some conditions(Dynamic allocation).


Fig 1


How do these components interact?:

On a high-level:

  1. When our code is submitted it creates a standalone application and the main method of our code is executed.

  2. Once the SparkSession is instantiated in our code, the program becomes the “Driver”.

  3. The Driver will contact the cluster manager(with the help of SparkSession), asking for resources to run your operations.

  4. Cluster manager launches executor on behalf of the Driver program.

  5. Once the launch is successful, executors register themselves to the driver to enable  direct communication.

  6. Based on the operations specified in the user program(operations), the driver devises various plans and schedules work on the executors as tasks.

  7. The executors compute and save/send the results to the Driver.

  8. The driver program exits if the main method exits or the spark session is explicitly stopped in the user program.

  9. The executors are shut down and the resources are released.

Conclusion:

This article should have given you some idea of execution in Spark. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 

Happy Learning! 😀




Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Spark Streaming APIs

  Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs.  The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Sp...