Skip to main content

Transformations in Spark


Now that we know there are two types of operations in Spark and one of them is Transformations, let's dive deeper into the types of Transformations.


Transformations are operations that definitely need optimizations when working with large datasets. When we chain multiple transformations together, since transformations are lazily evaluated, Spark can analyze what needs to be done and find optimized ways of executing them.


Transformations can be classified into two types based on their dependencies:

1. Narrow

2. Wide


Narrow Transformations:


A transformation can be classified as narrow when the resulting partition after an operation results from a single source partition. The output partition is dependent on only one input partition, hence the name narrow.

Let’s assume some transformation that is to be applied to source data which has two input partitions. If the transformation is a Narrow Transformation, then o/p partition 1 can be generated with just the i/p partition 1 and similarly for o/p partition 2. You could imagine weight loss as a Narrow Transformation in real life. At the end of the weight loss, there will be only one person but a transformed one.



Some of the examples for Narrow Transformations are: filter()/where(), contains().


Wide Transformations:


A Transformation can be classified as wide when the resulting partition after an operation results from records of multiple partitions. The output partition is now dependent on many input partitions, hence the name wide.



Fig 2

In the above diagram, o/p partition 1 cannot be generated from just applying “some transformation” to i/p partition1. Records in O/p partition 1 depend on both i/p partition 1 and i/p partition 2. Our beliefs can be considered to be the result of wide transformation involving wide dependencies. Beliefs might be supported by various references, these references are then  transformed to one solid belief. 


In the case of data, for wide transformations the data in one node needs to be shared with another in order to contribute to the generation of records of an o/p partition. The act of data being moved from one node to another is called shuffling. Data is transferred over the network of workers. The cost movement is directly dependent on the size of data being shuffled and also the frequency at which it is shuffled. Shuffle will be covered in detail in another article.

 

Some of the transformations which require shuffling is: groupBy(),join()

Conclusion:


I believe now you should have a basic understanding of schemas in Spark DataFrame API and how to create or define a schema. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀




Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Structured API Execution

  Spark Structured API execution can be split up into 3 high level steps: The syntax of the code is checked (common for any code) If the code is found to be valid, Spark converts this to a Logical Plan This Logical plan is then converted to Physical plan. In this process, any optimizations that are possible are applied. Spark executes this Physical Plan(RDD manipulations) on the cluster The first step does not just apply to Spark but to all types of programs. Python checks for syntax of the code written, and fails the job if there is any violation of syntax. The second step is what I would describe as Spark-specific. In this step, a sequence of three types of Logical plans are created, with the most optimal one at the end. Logical plans are abstract, with no clarity on low-level execution details. We could imagine it to be similar to a flow chart that gives a high-level idea about how operations are to be performed and in what order. As the high-level is a good start, a detailed pl...