Skip to main content

Schemas in Spark


By now you must be aware of what are the major distributed collections APIs that Structured APIs provide. If not please visit the following page to gain some idea about the Major APIs that Spark provides.

We know that DataFrames and Datasets are distributed table-like collections with well defined rows and columns. The number of rows for every column in a DF/DS should be the same. There are type restrictions on every column to which every row must adhere.


Schemas:


Schema is a definition of column names and their types for a dataframe. We can either specify schema manually or read the schema from the data source(which is called as schema on read).


Just like how each value stored in a variable of any programming language has a data type, the values that go into a column should also be of a specific Spark type. Spark has an engine called Catalyst which maintains all the type information. For all the language API that Spark supports, mappings can be found for each Spark type in the respective language API. Majority of the operations that are performed using Spark directly work on the Spark Types, not on language specific types. For example, let’s take the Students table that we have operated on the previous articles and add five grace marks to the total of every student, which can be represented as



Even though “df.Total+5” looks like a normal Python expression, what happens behind the scenes is that Spark converts this expression to Spark’s internal Catalyst representation after applying the respective type mapping. The 5 in the expression is of type int in Python, an equivalent mapping in Spark is IntegerType(). This mapping and other necessary conversions are done to render the expression to Catalyst representation and then executed.


Just like in any table the two important entities in a dataframe are Columns and Rows. There are Column and Row types, this has nothing to do with schemas and we do not use Column or Row to directly define schemas, which allows us to access individual columns and rows and perform some basic operations on those entities independently. It is intuitive that every column in a dataframe is of type Column and every row is of type Row. Rows and columns would be covered in depth in another article.


Spark Types:


As mentioned earlier, Spark has its own types and the following table provides a clear mapping to their corresponding Python types:

Spark Type

Type in Python

Remarks

API to create an instance of type

ByteType

int or long

At runtime, numbers will be converted to 1 Byte signed integers. Numbers should be within range: -128 to 127

ByteType()

ShortType

int or long

At runtime, numbers will be converted to 2 byte signed integers.

Range: -32768 to

32767

ShortType()

IntegerType

int or long

Loose range, based on Python. Better to use LongType for larger integers.

IntegerType()

LongType

long

At runtime, numbers will be converted to 8-byte signed integers. Range: –9223372036854775808 to 9223372036854775807

LongType()

FloatType

float

At runtime, numbers will be converted to 4-byte single precision floating point numbers.

FloatType()

DoubleType

float


DoubleType()

DecimalType

decimal.Decimal


DecimalType()

StringType

string 


StringType()


BinaryType

bytearray


BinaryType()

BooleanType

bool


BooleanType()

TimestampType

datetime.datetime


TimestampType()

DateType 

datetime.date


DateType()

ArrayType 

List, tuple or array

containsNull determines if the array element can be Null. default is True

ArrayType(elementType, [containsNull])

MapType

dict

valueContainsNull determines if the value in the map can contain null. Default is True

MapType(keyType, valueType, [valueContainsNull])

StructField

The type in Python depends on the data type provided for this field(column)

This type is used to define a field in a dataframe. Nullable determines whether this field is nullable and by default the value is True.

StructField(name, dataType, [nullable])

StructType

List or Tuple

Collection of fields that have certain types in Python

Fields is a list of StructFields

StructType(fields)

Following is an example of creating a schema to accommodate 4 columns 

With the created schema we can create a dataframe using the createDataFrame() method of the spark session object. Before creating the dataframe, we can create the data that would go inside the dataframe. It is not necessary but the following example creates a dataframe with some data.



In the above example I have created a schema with the type APIs provided by Spark. But this is not the only way to define a schema for Spark. We can also define using a schema string.


To define complex types like Array, Map and Structs you can use the following string:


‘Column_name ARRAY<Type of elements>, Column_name MAP<Type of key, type of value>, Column_name STRUCT<col_name:type,...> ’


Conclusion:


I believe now you should have a basic understanding of schemas in Spark DataFrame API and how to create or define a schema. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀



Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Spark Streaming APIs

  Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs.  The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Sp...