Schemas in Spark

By now you must be aware of what are the major distributed collections APIs that Structured APIs provide. If not please visit the following page to gain some idea about the Major APIs that Spark provides.

We know that DataFrames and Datasets are distributed table-like collections with well defined rows and columns. The number of rows for every column in a DF/DS should be the same. There are type restrictions on every column to which every row must adhere.

Schemas:

Schema is a definition of column names and their types for a dataframe. We can either specify schema manually or read the schema from the data source(which is called as schema on read).

Just like how each value stored in a variable of any programming language has a data type, the values that go into a column should also be of a specific Spark type. Spark has an engine called Catalyst which maintains all the type information. For all the language API that Spark supports, mappings can be found for each Spark type in the respective language API. Majority of the operations that are performed using Spark directly work on the Spark Types, not on language specific types. For example, let’s take the Students table that we have operated on the previous articles and add five grace marks to the total of every student, which can be represented as

Even though “df.Total+5” looks like a normal Python expression, what happens behind the scenes is that Spark converts this expression to Spark’s internal Catalyst representation after applying the respective type mapping. The 5 in the expression is of type int in Python, an equivalent mapping in Spark is IntegerType(). This mapping and other necessary conversions are done to render the expression to Catalyst representation and then executed.

Just like in any table the two important entities in a dataframe are Columns and Rows. There are Column and Row types, this has nothing to do with schemas and we do not use Column or Row to directly define schemas, which allows us to access individual columns and rows and perform some basic operations on those entities independently. It is intuitive that every column in a dataframe is of type Column and every row is of type Row. Rows and columns would be covered in depth in another article.

Spark Types:

As mentioned earlier, Spark has its own types and the following table provides a clear mapping to their corresponding Python types:

Spark Type	Type in Python	Remarks	API to create an instance of type
ByteType	int or long	At runtime, numbers will be converted to 1 Byte signed integers. Numbers should be within range: -128 to 127	ByteType()
ShortType	int or long	At runtime, numbers will be converted to 2 byte signed integers. Range: -32768 to 32767	ShortType()
IntegerType	int or long	Loose range, based on Python. Better to use LongType for larger integers.	IntegerType()
LongType	long	At runtime, numbers will be converted to 8-byte signed integers. Range: –9223372036854775808 to 9223372036854775807	LongType()
FloatType	float	At runtime, numbers will be converted to 4-byte single precision floating point numbers.	FloatType()
DoubleType	float		DoubleType()
DecimalType	decimal.Decimal		DecimalType()
StringType	string		StringType()
BinaryType	bytearray		BinaryType()
BooleanType	bool		BooleanType()
TimestampType	datetime.datetime		TimestampType()
DateType	datetime.date		DateType()
ArrayType	List, tuple or array	containsNull determines if the array element can be Null. default is True	ArrayType(elementType, [containsNull])
MapType	dict	valueContainsNull determines if the value in the map can contain null. Default is True	MapType(keyType, valueType, [valueContainsNull])
StructField	The type in Python depends on the data type provided for this field(column)	This type is used to define a field in a dataframe. Nullable determines whether this field is nullable and by default the value is True.	StructField(name, dataType, [nullable])
StructType	List or Tuple	Collection of fields that have certain types in Python Fields is a list of StructFields	StructType(fields)

Following is an example of creating a schema to accommodate 4 columns

With the created schema we can create a dataframe using the createDataFrame() method of the spark session object. Before creating the dataframe, we can create the data that would go inside the dataframe. It is not necessary but the following example creates a dataframe with some data.

In the above example I have created a schema with the type APIs provided by Spark. But this is not the only way to define a schema for Spark. We can also define using a schema string.

To define complex types like Array, Map and Structs you can use the following string:

‘Column_name ARRAY<Type of elements>, Column_name MAP<Type of key, type of value>, Column_name STRUCT<col_name:type,...> ’

Conclusion:

I believe now you should have a basic understanding of schemas in Spark DataFrame API and how to create or define a schema. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.

Happy Learning! 😀

0x3eb

Search This Blog

Schemas in Spark

Comments

Post a Comment

Popular posts from this blog

Rows in Spark 101

Introduction to Structured Streaming

Spark Streaming APIs