Skip to main content

Spark Execution modes

 This article covers execution modes,  which provides us the control to decide where different components of our Spark Application will be physically located. Spark provides us with three options:

  1. Cluster mode

  2. Client mode

  3. Local mode


Cluster Mode

Cluster is nothing but a collection of computers, aka nodes, that are networked together. The clusters in the big data ecosystem are used to perform parallel processing of huge data sets. These nodes are not efficient on their own, cluster manager is needed to bring out the best of the available resources. Cluster manager is responsible for maintaining the nodes, receiving job requests, allocating nodes for jobs, monitoring the resources and many other tasks. Cluster manager is a daemon(background process) that runs in each node of the cluster. Hence, details and status of each node is collected and based on that decision can be made. Cluster manager has its own concept of master(driver) and worker abstraction. One or more nodes can act as masters and the rest will be workers. Based on execution mode configuration, the driver and/or executors are started within the cluster by cluster manager. Spark currently supports three cluster managers: a simple built-in standalone cluster manager, Apache Mesos and Hadoop YARN. The number of supported cluster managers continues to grow. 


Cluster mode is the prevalent execution mode of running Spark Applications. We submit our Python script to a cluster manager, the driver process and executor processes are launched within the cluster Cluster manager is given the sole responsibility of maintaining all the Spark Application related processes. Since the driver is also within the cluster, the data can be easily transmitted. We can retrieve job logs using some cluster manager related commands Remember that each Spark Application has only one driver process. Cluster mode is usually used in production applications.


Client Mode

While using client mode, the client that submits the application hosts the driver. The executor processes are managed by the cluster manager. The executor processes are started by the Cluster Manager on the worker nodes. So the driver and executors are not colocated in the same machine. Executors communicate with the driver on another machine. The client machine that submits the job is usually called an edge node. Client mode is usually used for interactive debugging processes. The default deployment mode is client mode. If the client running the spark-submit(utility to run and submit spark jobs) terminates while the job is in progress, the spark application terminates resulting in a failed job. The driver logs are accessible from the local machine itself. 


As the client machine is not part of the cluster, the network connection between the machines plays a big part in the performance observed. Because data might have to be moved between these machines and for that network is required.


Local mode


Local mode is the most simplistic execution mode that Spark provides. All the Spark Application components, driver and executors, run on a single machine. You might be wondering how parallel data processing is possible if Spark uses only one machine. It is possible by using multiple threads on the single machine. This mode is prevalent for local development and not in production applications.



Execution Mode

Driver

Executors

cluster

Cluster Worker node

Cluster Worker nodes

client

Client machine that is not part of the cluster

Cluster Worker nodes

local

Runs on the local JVM

Runs on the same local JVM as the driver


We can configure execution mode during submission of our job using the flag --deploy-mode. 


spark-submit --deploy-mode cluster/client ...

(or)

spark-submit --master local


The first method can be used to specify cluster/client execution modes and the second method can be used for local execution mode.


The Life Cycle of a Spark Application (outside view):


This topic provides an outside view of the lifecycle of a Spark Application. Assume that a cluster of four nodes, one driver and three worker nodes are available to run our job. The job is submitted using spark-submit, which will be covered in a separate article. At a high-level, spark-submit is a utility to submit and start spark jobs. Just like how we type python script_name.py in the command line when we want to run python scripts. 

1. Client Request


The first step in the life-cycle is to submit an actual application. In the case of PySpark jobs, this will be a python script or library. While we submit our job by running a spark-submit in a client machine, only resources for the driver are requested from the cluster manager. Once the cluster manager accepts the request, it starts the driver process in a worker node. The client process(spark-submit) that submitted the job exits and our Spark application is now active in the cluster. If we need specific requirements for the driver, ex: memory that is allocated to the drive, we can configure this using options in the spark-submit utility.


spark-submit --master \ --deploy-mode cluster \ --conf = \ ... # other options \ [application-arguments] 


2. Launch

After the driver process starts, the code that we submitted is run. This code must create a SparkSession object and this object in turn will communicate to the cluster manager, requesting for resources for the executors. Executor related configurations can be made in the spark-submit utility call, similar to driver related configurations, and as per that the resources are assigned and executor processes are started. Once the executors are up and running, the cluster manager makes sure to send information about the executors’ location to the driver process. 


3. Execution

Now that all the resources required for a job to run are well placed, the code that we submitted continues executing. As per the instructions, execution plans are created and the driver schedules “tasks” onto each of the executors. The data will be read and data will be moved around within the application(driver+executors) as required. Each executor responds with the status of the tasks and finally success or failure.


4.Completion

Once the driver reaches the end of execution, it exits either with success or failure. The cluster manager notices this and shuts down the executors, related to the exited driver, in the cluster.


Conclusion:


If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀



Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Spark Streaming APIs

  Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs.  The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Sp...