Skip to main content

Partitions in Spark (Basics)

What are partitions?

Partitions are nothing but subset of rows/records that make up your data, which is assigned to an executor's core.

Why do we need partitions/partitioning?

Spark is a data analytics engine that is designed to deal with big data and make processing fast. It achieves this by splitting data into partitions and as mentioned above assigns each partition to an executor's core in order to be processed. A partition cannot be mapped to more than one executor core. The processing is done parallelly on all the partitions.
It is important to note that one partition is assigned to one executor at any point in time. But the inverse is not true, an executor can be associated with more than one partition at a time.  
 
To explain this clearly, let's say a group of 4 kids were given an assignment of finding the number of occurrences of the word 'is' in a textbook of more than two hundred(a pretty lame assignment indeed).

The kids, say A, B, C, D, decide between themselves that they would be only reading and finding out the word count for fifty pages each and finally adding them up. A- 1 to 50, B 51 to100 and so on. By doing this, they ensure not only each one reads and counts less(making them less prone to error and overload) but also decrease the amount of time each kid spends in doing the assignment. What you saw here was partitioning and parallel execution(these kids are surely brilliant). 

  • Textbook == Data
  • Kids == Executor/Workers(Not in a literal sense)
  • Page == Record
  • 50 pages == Partition size
  • Total number of partitions that are parallelly worked on is 4.

How are partitions created?

Spark handles the complexities of splitting the data into partitions but that does not necessarily means it will be optimal for every workload. Spark uses algorithms to split data into partitions, deciding which record goes into which partition. Two of the common partitioning are Hash Partitioning and Range Partitioning. These default partitioning methods might not be suitable for every workload. In many workloads there are possibilities of empty partitions(partitions with no records in it). So Spark allows us to determine how the data should be partitioned by implementing a Custom Partitioner (class that instructs how a partition should occur). Even though we can determine how records are partitioned, we cannot control the record to be placed in a particular executor. Because there is a possibility that executors might be lost due to various reasons and Spark will be taking care of replicating the data in an another executor. Which ensures the fault tolerant assurance that Spark provides.

Is there a restriction on the size of a partition?

The size of a partition is controlled by one of the spark configuration spark.sql.files.maxPartitionBytes and can be altered based on preference. 

How do we check the number of partitions for our DataFrame/RDD?

Let's say we have table of students data which we have loaded into a data frame like this:



We know that dataframes are made up of RDD, in order to check number of partitions we can use the RDD attribute's getNumPartitions method.


So our data is split into seven partitions.
To check the same for RDD, as might have guessed it, call the getNumPartitions method on the RDD itself:


Conclusion:

So we covered some basics on partitions and understanding the need for partitions as well. This article will be further updated in due course of time, so keep an eye on this space. My suggestion to you would be to bookmark this page for reference and check for updates in the future. There will also be other articles which stem from this one for an in-depth look. Thank you for reading and please let me know if you have any questions or suggestion in the comment box. Happy learning ðŸ˜‰



Comments

Popular posts from this blog

Spark Streaming APIs

  Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs.  The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Sp...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Joins in Spark

  Joins are a crucial transformation when it comes to data. It allows us to enrich the data that we currently have with some more data. As the name suggests, the join() method allows us to join two dataframes. We are allowed to define what strategy is to be used and on what conditions the dataframes are to be joined. I had given a brief introduction in this article , if you have not checked it out I kindly request you to do so. The join() method takes three arguments: other on how The other should be an object of Class DataFrame , argument on should be either a condition or a list of column(s) or column name(s). Argument how determines what join strategy is to be used.  The how argument accepts a variety of string values that should be one of the following: Allowed Values for how Diagram inner cross outer,full,fullouter,full_outer left,leftouter,left_outer right,rightouter,right_outer semi,left_semi,leftsemi Note: Same as inner join, but the final dataframe has only colu...