Partitions in Spark (Basics)

What are partitions?

Partitions are nothing but subset of rows/records that make up your data, which is assigned to an executor's core.

Why do we need partitions/partitioning?

Spark is a data analytics engine that is designed to deal with big data and make processing fast. It achieves this by splitting data into partitions and as mentioned above assigns each partition to an executor's core in order to be processed. A partition cannot be mapped to more than one executor core. The processing is done parallelly on all the partitions.
It is important to note that one partition is assigned to one executor at any point in time. But the inverse is not true, an executor can be associated with more than one partition at a time.

To explain this clearly, let's say a group of 4 kids were given an assignment of finding the number of occurrences of the word 'is' in a textbook of more than two hundred(a pretty lame assignment indeed).

The kids, say A, B, C, D, decide between themselves that they would be only reading and finding out the word count for fifty pages each and finally adding them up. A- 1 to 50, B 51 to100 and so on. By doing this, they ensure not only each one reads and counts less(making them less prone to error and overload) but also decrease the amount of time each kid spends in doing the assignment. What you saw here was partitioning and parallel execution(these kids are surely brilliant).

Textbook == Data
Kids == Executor/Workers(Not in a literal sense)
Page == Record
50 pages == Partition size
Total number of partitions that are parallelly worked on is 4.

How are partitions created?

Spark handles the complexities of splitting the data into partitions but that does not necessarily means it will be optimal for every workload. Spark uses algorithms to split data into partitions, deciding which record goes into which partition. Two of the common partitioning are Hash Partitioning and Range Partitioning. These default partitioning methods might not be suitable for every workload. In many workloads there are possibilities of empty partitions(partitions with no records in it). So Spark allows us to determine how the data should be partitioned by implementing a Custom Partitioner (class that instructs how a partition should occur). Even though we can determine how records are partitioned, we cannot control the record to be placed in a particular executor. Because there is a possibility that executors might be lost due to various reasons and Spark will be taking care of replicating the data in an another executor. Which ensures the fault tolerant assurance that Spark provides.

Is there a restriction on the size of a partition?

The size of a partition is controlled by one of the spark configuration spark.sql.files.maxPartitionBytes and can be altered based on preference.

How do we check the number of partitions for our DataFrame/RDD?

Let's say we have table of students data which we have loaded into a data frame like this:

We know that dataframes are made up of RDD, in order to check number of partitions we can use the RDD attribute's getNumPartitions method.

So our data is split into seven partitions.

To check the same for RDD, as might have guessed it, call the getNumPartitions method on the RDD itself:

Conclusion:

So we covered some basics on partitions and understanding the need for partitions as well. This article will be further updated in due course of time, so keep an eye on this space. My suggestion to you would be to bookmark this page for reference and check for updates in the future. There will also be other articles which stem from this one for an in-depth look. Thank you for reading and please let me know if you have any questions or suggestion in the comment box. Happy learning 😉

0x3eb

Search This Blog