All About Data Engineering

What is Shuffling in Spark

Shuffling in Spark is a mechanism that Re-Distributes the data across different executors or workers in the clusters. Why do we need to Re-Distribute the data? A) Re-Distribution is needed when there is a need of increasing or decreasing the data partitions in the situations below: When the partitions are not sufficient enough to process the data load in the cluster When the partitions are too high in numbers that it creates task scheduling overhead and it becomes the bottleneck in the processing time. Re-Distribution can also be achieved by executing the shuffling on existing distributed data collection like RDD, DataFrames, etc by using the "Repartition" and "Coalesce" APIs in Spark. B) During Aggregation and Joins on data collection in Spark, all the data records belonging to aggregation or join should reside in the single partition and when the existing partitioning scheme doesn't satisfy this condition there is a need to re-distributing the data in in...

Search This Blog

Posts

What is Shuffling in Spark