Skip to main content

Partitions in Spark (Basics)

What are partitions?

Partitions are nothing but subset of rows/records that make up your data, which is assigned to an executor's core.

Why do we need partitions/partitioning?

Spark is a data analytics engine that is designed to deal with big data and make processing fast. It achieves this by splitting data into partitions and as mentioned above assigns each partition to an executor's core in order to be processed. A partition cannot be mapped to more than one executor core. The processing is done parallelly on all the partitions.
It is important to note that one partition is assigned to one executor at any point in time. But the inverse is not true, an executor can be associated with more than one partition at a time.  
 
To explain this clearly, let's say a group of 4 kids were given an assignment of finding the number of occurrences of the word 'is' in a textbook of more than two hundred(a pretty lame assignment indeed).

The kids, say A, B, C, D, decide between themselves that they would be only reading and finding out the word count for fifty pages each and finally adding them up. A- 1 to 50, B 51 to100 and so on. By doing this, they ensure not only each one reads and counts less(making them less prone to error and overload) but also decrease the amount of time each kid spends in doing the assignment. What you saw here was partitioning and parallel execution(these kids are surely brilliant). 

  • Textbook == Data
  • Kids == Executor/Workers(Not in a literal sense)
  • Page == Record
  • 50 pages == Partition size
  • Total number of partitions that are parallelly worked on is 4.

How are partitions created?

Spark handles the complexities of splitting the data into partitions but that does not necessarily means it will be optimal for every workload. Spark uses algorithms to split data into partitions, deciding which record goes into which partition. Two of the common partitioning are Hash Partitioning and Range Partitioning. These default partitioning methods might not be suitable for every workload. In many workloads there are possibilities of empty partitions(partitions with no records in it). So Spark allows us to determine how the data should be partitioned by implementing a Custom Partitioner (class that instructs how a partition should occur). Even though we can determine how records are partitioned, we cannot control the record to be placed in a particular executor. Because there is a possibility that executors might be lost due to various reasons and Spark will be taking care of replicating the data in an another executor. Which ensures the fault tolerant assurance that Spark provides.

Is there a restriction on the size of a partition?

The size of a partition is controlled by one of the spark configuration spark.sql.files.maxPartitionBytes and can be altered based on preference. 

How do we check the number of partitions for our DataFrame/RDD?

Let's say we have table of students data which we have loaded into a data frame like this:



We know that dataframes are made up of RDD, in order to check number of partitions we can use the RDD attribute's getNumPartitions method.


So our data is split into seven partitions.
To check the same for RDD, as might have guessed it, call the getNumPartitions method on the RDD itself:


Conclusion:

So we covered some basics on partitions and understanding the need for partitions as well. This article will be further updated in due course of time, so keep an eye on this space. My suggestion to you would be to bookmark this page for reference and check for updates in the future. There will also be other articles which stem from this one for an in-depth look. Thank you for reading and please let me know if you have any questions or suggestion in the comment box. Happy learning ðŸ˜‰



Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Structured API Execution

  Spark Structured API execution can be split up into 3 high level steps: The syntax of the code is checked (common for any code) If the code is found to be valid, Spark converts this to a Logical Plan This Logical plan is then converted to Physical plan. In this process, any optimizations that are possible are applied. Spark executes this Physical Plan(RDD manipulations) on the cluster The first step does not just apply to Spark but to all types of programs. Python checks for syntax of the code written, and fails the job if there is any violation of syntax. The second step is what I would describe as Spark-specific. In this step, a sequence of three types of Logical plans are created, with the most optimal one at the end. Logical plans are abstract, with no clarity on low-level execution details. We could imagine it to be similar to a flow chart that gives a high-level idea about how operations are to be performed and in what order. As the high-level is a good start, a detailed pl...