Datasets are one of the APIs provided in the Spark’s Structured API. It is known as the foundational type of the structured APIs, as dataframes themselves are a type of dataset. Datasets API is a JVM language feature,hence it works only with Scala and Java. Datasets are also known as typed set API. The name is due to the fact that datasets are strongly typed, to be more specific they are domain-specific typed objects. They can be operated on in parallel using functional programming constructs or operations that we are familiar with the Dataframe API.
Datasets are represented by the following notation:
Dataset[T]
Where T is a domain-specific type defined using either Scala Case classes or JavaBean Classes. This type can contain multiple attributes, which acts as columns in the datasets. Each row in the dataset will be an object of class T. One example case class can be found below
case class Product(pId: Int, product_name:String, category: String, price: Double)
The case class Product has four attributes and any dataset created with this case class as schema will have rows with four columns that are identical to the attributes defined in the class.Spark types(internal) such as StringType, IntegerType to map to Spark's supported languages’ inbuilt types, like String, Int. To support domain-specific objects efficiently, a specific concept called Encoders is required. The encoder maps the domain-specific type T to Spark's internal type system. Encoder directs Spark to generate code at runtime to serialize the dataset schema into Spark internal binary structure. When we use dataframes, this binary structure will be a Row. While using the dataset API, for every row(Domain-specific object) the application touches Spark then converts the internal Spark row format to the object we specified using the case class or JavaBean class. This conversion slows down the operations that are defined, but can provide more flexibility and type-safety.
One disadvantage of using Datasets is that it requires some pre-planning. Because we have to know all the individual column names and types for the rows we are reading, which will be used to define the domain-specific object. This is not the case in dataframes, where we can optionally let Spark infer the schema. Because Dataset API requires that we define our data types ahead of time, the definition of schema requires pre-planning.
Since dataframes are a type of datasets, we can convert a dataframe to a dataset and vice versa. To convert a dataframe to a dataset, we need a case class or JavasBean class that defines the schema of the resulting dataset. In scala, we can use the following notation to convert a dataframe to a dataset of type schemaCaseClass:
val dsSchemaCaseClass = df.as[schemaCaseClass]
The above statement instructs Spark to use encoders to serialize/deserialize the Row object from Spark’s internal memory representation to JVM object (schemaCaseClass in this case). Similar to the above conversion, a dataset can be converted to a dataframe(dataset[Row]) using the toDF method:
val df = dsSchemaCaseClass.toDF()
When to use Datasets?
Now that we are aware that there are two options to represent our data in table-like format, dataframes and datasets, it is important to know when it will be appropriate to employ one of the above. Since this article is scoped to Datasets, this section will stick to the same scope. When the operations that we wish to perform on the data cannot be expressed using methods provided with the dataframe API, datasets can be the prime contender to perform those operations. Since we have defined domain-specific objects, we have the freedom to perform these operations expressed using functions. The functions can implement a large set of business logic, which might be easier in some cases than expressing them in the form of SparkSQL. Datasets allow us to specify more complex and strongly typed transformations than we could perform on DataFrames alone using the provided Spark APIs, because we manipulate JVM types. Since we are expressing the logic using simple Java or Scala, these transformations can be reused between single-node and Spark workloads. One of the obvious use cases of datasets might be when we have a need for type safety. As we know the type safe comes with the cost added by serializing/deserializing, so when the workload we have requires type safety and can tolerate the cost in terms of performance, Dataset seems the obvious choice.
The most popular usage of datasets are usually in combination with their “un-typed” counterpart dataframes. When we use both of these APIs together, we get the best of both worlds.
Conclusion;
I hope this article was a good place to start. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.
Happy Learning! 😀
Comments
Post a Comment