This article covers execution modes, which provides us the control to decide where different components of our Spark Application will be physically located. Spark provides us with three options:
Cluster mode
Client mode
Local mode
Cluster Mode
Cluster is nothing but a collection of computers, aka nodes, that are networked together. The clusters in the big data ecosystem are used to perform parallel processing of huge data sets. These nodes are not efficient on their own, cluster manager is needed to bring out the best of the available resources. Cluster manager is responsible for maintaining the nodes, receiving job requests, allocating nodes for jobs, monitoring the resources and many other tasks. Cluster manager is a daemon(background process) that runs in each node of the cluster. Hence, details and status of each node is collected and based on that decision can be made. Cluster manager has its own concept of master(driver) and worker abstraction. One or more nodes can act as masters and the rest will be workers. Based on execution mode configuration, the driver and/or executors are started within the cluster by cluster manager. Spark currently supports three cluster managers: a simple built-in standalone cluster manager, Apache Mesos and Hadoop YARN. The number of supported cluster managers continues to grow.
Cluster mode is the prevalent execution mode of running Spark Applications. We submit our Python script to a cluster manager, the driver process and executor processes are launched within the cluster Cluster manager is given the sole responsibility of maintaining all the Spark Application related processes. Since the driver is also within the cluster, the data can be easily transmitted. We can retrieve job logs using some cluster manager related commands Remember that each Spark Application has only one driver process. Cluster mode is usually used in production applications.
Client Mode
While using client mode, the client that submits the application hosts the driver. The executor processes are managed by the cluster manager. The executor processes are started by the Cluster Manager on the worker nodes. So the driver and executors are not colocated in the same machine. Executors communicate with the driver on another machine. The client machine that submits the job is usually called an edge node. Client mode is usually used for interactive debugging processes. The default deployment mode is client mode. If the client running the spark-submit(utility to run and submit spark jobs) terminates while the job is in progress, the spark application terminates resulting in a failed job. The driver logs are accessible from the local machine itself.
As the client machine is not part of the cluster, the network connection between the machines plays a big part in the performance observed. Because data might have to be moved between these machines and for that network is required.
Local mode
Local mode is the most simplistic execution mode that Spark provides. All the Spark Application components, driver and executors, run on a single machine. You might be wondering how parallel data processing is possible if Spark uses only one machine. It is possible by using multiple threads on the single machine. This mode is prevalent for local development and not in production applications.
We can configure execution mode during submission of our job using the flag --deploy-mode.
spark-submit --deploy-mode cluster/client ...
(or)
spark-submit --master local
The first method can be used to specify cluster/client execution modes and the second method can be used for local execution mode.
The Life Cycle of a Spark Application (outside view):
This topic provides an outside view of the lifecycle of a Spark Application. Assume that a cluster of four nodes, one driver and three worker nodes are available to run our job. The job is submitted using spark-submit, which will be covered in a separate article. At a high-level, spark-submit is a utility to submit and start spark jobs. Just like how we type python script_name.py in the command line when we want to run python scripts.
1. Client Request
The first step in the life-cycle is to submit an actual application. In the case of PySpark jobs, this will be a python script or library. While we submit our job by running a spark-submit in a client machine, only resources for the driver are requested from the cluster manager. Once the cluster manager accepts the request, it starts the driver process in a worker node. The client process(spark-submit) that submitted the job exits and our Spark application is now active in the cluster. If we need specific requirements for the driver, ex: memory that is allocated to the drive, we can configure this using options in the spark-submit utility.
spark-submit --master \ --deploy-mode cluster \ --conf = \ ... # other options \ [application-arguments]
2. Launch
After the driver process starts, the code that we submitted is run. This code must create a SparkSession object and this object in turn will communicate to the cluster manager, requesting for resources for the executors. Executor related configurations can be made in the spark-submit utility call, similar to driver related configurations, and as per that the resources are assigned and executor processes are started. Once the executors are up and running, the cluster manager makes sure to send information about the executors’ location to the driver process.
3. Execution
Now that all the resources required for a job to run are well placed, the code that we submitted continues executing. As per the instructions, execution plans are created and the driver schedules “tasks” onto each of the executors. The data will be read and data will be moved around within the application(driver+executors) as required. Each executor responds with the status of the tasks and finally success or failure.
4.Completion
Once the driver reaches the end of execution, it exits either with success or failure. The cluster manager notices this and shuts down the executors, related to the exited driver, in the cluster.
Conclusion:
If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.
Happy Learning! 😀
Comments
Post a Comment