Spark Execution modes

This article covers execution modes, which provides us the control to decide where different components of our Spark Application will be physically located. Spark provides us with three options:

Cluster mode
Client mode
Local mode

Cluster Mode

Cluster is nothing but a collection of computers, aka nodes, that are networked together. The clusters in the big data ecosystem are used to perform parallel processing of huge data sets. These nodes are not efficient on their own, cluster manager is needed to bring out the best of the available resources. Cluster manager is responsible for maintaining the nodes, receiving job requests, allocating nodes for jobs, monitoring the resources and many other tasks. Cluster manager is a daemon(background process) that runs in each node of the cluster. Hence, details and status of each node is collected and based on that decision can be made. Cluster manager has its own concept of master(driver) and worker abstraction. One or more nodes can act as masters and the rest will be workers. Based on execution mode configuration, the driver and/or executors are started within the cluster by cluster manager. Spark currently supports three cluster managers: a simple built-in standalone cluster manager, Apache Mesos and Hadoop YARN. The number of supported cluster managers continues to grow.

Cluster mode is the prevalent execution mode of running Spark Applications. We submit our Python script to a cluster manager, the driver process and executor processes are launched within the cluster Cluster manager is given the sole responsibility of maintaining all the Spark Application related processes. Since the driver is also within the cluster, the data can be easily transmitted. We can retrieve job logs using some cluster manager related commands Remember that each Spark Application has only one driver process. Cluster mode is usually used in production applications.

Client Mode

While using client mode, the client that submits the application hosts the driver. The executor processes are managed by the cluster manager. The executor processes are started by the Cluster Manager on the worker nodes. So the driver and executors are not colocated in the same machine. Executors communicate with the driver on another machine. The client machine that submits the job is usually called an edge node. Client mode is usually used for interactive debugging processes. The default deployment mode is client mode. If the client running the spark-submit(utility to run and submit spark jobs) terminates while the job is in progress, the spark application terminates resulting in a failed job. The driver logs are accessible from the local machine itself.

As the client machine is not part of the cluster, the network connection between the machines plays a big part in the performance observed. Because data might have to be moved between these machines and for that network is required.

Local mode

Local mode is the most simplistic execution mode that Spark provides. All the Spark Application components, driver and executors, run on a single machine. You might be wondering how parallel data processing is possible if Spark uses only one machine. It is possible by using multiple threads on the single machine. This mode is prevalent for local development and not in production applications.

Execution Mode	Driver	Executors
cluster	Cluster Worker node	Cluster Worker nodes
client	Client machine that is not part of the cluster	Cluster Worker nodes
local	Runs on the local JVM	Runs on the same local JVM as the driver

We can configure execution mode during submission of our job using the flag --deploy-mode.

spark-submit --deploy-mode cluster/client ...

(or)

spark-submit --master local

The first method can be used to specify cluster/client execution modes and the second method can be used for local execution mode.

The Life Cycle of a Spark Application (outside view):

This topic provides an outside view of the lifecycle of a Spark Application. Assume that a cluster of four nodes, one driver and three worker nodes are available to run our job. The job is submitted using spark-submit, which will be covered in a separate article. At a high-level, spark-submit is a utility to submit and start spark jobs. Just like how we type python script_name.py in the command line when we want to run python scripts.

1. Client Request

The first step in the life-cycle is to submit an actual application. In the case of PySpark jobs, this will be a python script or library. While we submit our job by running a spark-submit in a client machine, only resources for the driver are requested from the cluster manager. Once the cluster manager accepts the request, it starts the driver process in a worker node. The client process(spark-submit) that submitted the job exits and our Spark application is now active in the cluster. If we need specific requirements for the driver, ex: memory that is allocated to the drive, we can configure this using options in the spark-submit utility.

spark-submit --master \ --deploy-mode cluster \ --conf = \ ... # other options \ [application-arguments]

2. Launch

After the driver process starts, the code that we submitted is run. This code must create a SparkSession object and this object in turn will communicate to the cluster manager, requesting for resources for the executors. Executor related configurations can be made in the spark-submit utility call, similar to driver related configurations, and as per that the resources are assigned and executor processes are started. Once the executors are up and running, the cluster manager makes sure to send information about the executors’ location to the driver process.

3. Execution

Now that all the resources required for a job to run are well placed, the code that we submitted continues executing. As per the instructions, execution plans are created and the driver schedules “tasks” onto each of the executors. The data will be read and data will be moved around within the application(driver+executors) as required. Each executor responds with the status of the tasks and finally success or failure.

4.Completion

Once the driver reaches the end of execution, it exits either with success or failure. The cluster manager notices this and shuts down the executors, related to the exited driver, in the cluster.

Conclusion:

If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.

Happy Learning! 😀

Introduction to Structured Streaming

Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

0x3eb

Search This Blog

Spark Execution modes

Comments

Post a Comment

Popular posts from this blog

Rows in Spark 101

Introduction to Structured Streaming

Structured API Execution