Skip to main content

Datasets API

Datasets are one of the APIs provided in the Spark’s Structured API. It is known as the foundational type of the structured APIs, as dataframes themselves are a type of dataset. Datasets API is a JVM language feature,hence it works only with Scala and Java. Datasets are also known as typed set API. The  name is due to the fact that datasets are strongly typed, to be more specific they are domain-specific typed objects. They can be operated on in parallel using functional programming constructs or operations that we are familiar with the Dataframe API.  

Datasets are represented by the following notation:


Dataset[T]


Where T is a domain-specific type defined using either Scala Case classes or JavaBean Classes. This type can contain multiple attributes, which acts as columns in the datasets. Each row in the dataset will be an object of class T. One example case class can be found below


case class Product(pId: Int, product_name:String, category: String, price: Double)


The case class Product has four attributes and any dataset created with this case class as schema will have rows with four columns that are identical to the attributes defined in the class.Spark types(internal) such as StringType, IntegerType to map to Spark's supported languages’ inbuilt types, like String, Int. To support domain-specific objects efficiently, a specific concept called Encoders is required. The encoder maps the domain-specific type T to Spark's internal type system. Encoder directs Spark to generate code at runtime to serialize the dataset schema into Spark internal binary structure. When we use dataframes, this binary structure will be a Row. While using the dataset API, for every row(Domain-specific object) the application touches Spark then converts the internal Spark row format to the object we specified using the case class or JavaBean class. This conversion slows down the operations that are defined, but can provide more flexibility and type-safety. 


One disadvantage of using Datasets is that it requires some pre-planning. Because we have to know all the individual column names and types for the rows we are reading, which will be used to define the domain-specific object. This is not the case in dataframes, where we can optionally let Spark infer the schema. Because Dataset API requires that we define our data types ahead of time, the definition of schema requires pre-planning. 


Since dataframes are a type of datasets, we can convert a dataframe to a dataset and vice versa. To convert a dataframe to a dataset, we need a case class or JavasBean class that defines the schema of the resulting dataset. In scala, we can use the following notation to convert a dataframe to a dataset of type schemaCaseClass:


val dsSchemaCaseClass = df.as[schemaCaseClass]


The above statement instructs Spark to use encoders to serialize/deserialize the Row object from Spark’s internal memory representation to JVM object (schemaCaseClass in this case). Similar to the above conversion, a dataset can be converted to a dataframe(dataset[Row]) using the toDF method:


val df = dsSchemaCaseClass.toDF()


When to use Datasets?


Now that we are aware that there are two options to represent our data in table-like format, dataframes and datasets, it is important to know when it will be appropriate to employ one of the above. Since this article is scoped to Datasets, this section will stick to the same scope. When the operations that we wish to perform on the data cannot be expressed using methods provided with the dataframe API, datasets can be the prime contender to perform those operations. Since we have defined domain-specific objects, we have the freedom to perform these operations expressed using functions. The functions can implement a large set of business logic, which might be easier in some cases than expressing them in the form of SparkSQL. Datasets allow us to specify more complex and strongly typed transformations than we could perform on DataFrames alone using the provided Spark APIs, because we manipulate JVM types. Since we are expressing the logic using simple Java or Scala, these transformations can be reused between single-node and Spark workloads. One of the obvious use cases of datasets might be when we have a need for type safety. As we know the type safe comes with the cost added by serializing/deserializing, so when the workload we have requires type safety and can tolerate the cost in terms of performance, Dataset seems the obvious choice. 


The most popular usage of datasets are usually in combination with their “un-typed” counterpart dataframes. When we use both of these APIs together, we get the best of both worlds.  


Conclusion;


I hope this article was a good place to start. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀


Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Spark Streaming APIs

  Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs.  The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Sp...