Now that we know there are two types of operations in Spark and one of them is Transformations, let's dive deeper into the types of Transformations.
Transformations are operations that definitely need optimizations when working with large datasets. When we chain multiple transformations together, since transformations are lazily evaluated, Spark can analyze what needs to be done and find optimized ways of executing them.
Transformations can be classified into two types based on their dependencies:
1. Narrow
2. Wide
Narrow Transformations:
A transformation can be classified as narrow when the resulting partition after an operation results from a single source partition. The output partition is dependent on only one input partition, hence the name narrow.
Let’s assume some transformation that is to be applied to source data which has two input partitions. If the transformation is a Narrow Transformation, then o/p partition 1 can be generated with just the i/p partition 1 and similarly for o/p partition 2. You could imagine weight loss as a Narrow Transformation in real life. At the end of the weight loss, there will be only one person but a transformed one.
Some of the examples for Narrow Transformations are: filter()/where(), contains().
Wide Transformations:
A Transformation can be classified as wide when the resulting partition after an operation results from records of multiple partitions. The output partition is now dependent on many input partitions, hence the name wide.
In the above diagram, o/p partition 1 cannot be generated from just applying “some transformation” to i/p partition1. Records in O/p partition 1 depend on both i/p partition 1 and i/p partition 2. Our beliefs can be considered to be the result of wide transformation involving wide dependencies. Beliefs might be supported by various references, these references are then transformed to one solid belief.
In the case of data, for wide transformations the data in one node needs to be shared with another in order to contribute to the generation of records of an o/p partition. The act of data being moved from one node to another is called shuffling. Data is transferred over the network of workers. The cost movement is directly dependent on the size of data being shuffled and also the frequency at which it is shuffled. Shuffle will be covered in detail in another article.
Some of the transformations which require shuffling is: groupBy(),join()
Conclusion:
I believe now you should have a basic understanding of schemas in Spark DataFrame API and how to create or define a schema. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.
Comments
Post a Comment