By now you must be aware of what are the major distributed collections APIs that Structured APIs provide. If not please visit the following page to gain some idea about the Major APIs that Spark provides.
We know that DataFrames and Datasets are distributed table-like collections with well defined rows and columns. The number of rows for every column in a DF/DS should be the same. There are type restrictions on every column to which every row must adhere.
Schemas:
Schema is a definition of column names and their types for a dataframe. We can either specify schema manually or read the schema from the data source(which is called as schema on read).
Just like how each value stored in a variable of any programming language has a data type, the values that go into a column should also be of a specific Spark type. Spark has an engine called Catalyst which maintains all the type information. For all the language API that Spark supports, mappings can be found for each Spark type in the respective language API. Majority of the operations that are performed using Spark directly work on the Spark Types, not on language specific types. For example, let’s take the Students table that we have operated on the previous articles and add five grace marks to the total of every student, which can be represented as
Even though “df.Total+5” looks like a normal Python expression, what happens behind the scenes is that Spark converts this expression to Spark’s internal Catalyst representation after applying the respective type mapping. The 5 in the expression is of type int in Python, an equivalent mapping in Spark is IntegerType(). This mapping and other necessary conversions are done to render the expression to Catalyst representation and then executed.
Just like in any table the two important entities in a dataframe are Columns and Rows. There are Column and Row types, this has nothing to do with schemas and we do not use Column or Row to directly define schemas, which allows us to access individual columns and rows and perform some basic operations on those entities independently. It is intuitive that every column in a dataframe is of type Column and every row is of type Row. Rows and columns would be covered in depth in another article.
Spark Types:
As mentioned earlier, Spark has its own types and the following table provides a clear mapping to their corresponding Python types:
Following is an example of creating a schema to accommodate 4 columns
With the created schema we can create a dataframe using the createDataFrame() method of the spark session object. Before creating the dataframe, we can create the data that would go inside the dataframe. It is not necessary but the following example creates a dataframe with some data.
In the above example I have created a schema with the type APIs provided by Spark. But this is not the only way to define a schema for Spark. We can also define using a schema string.
To define complex types like Array, Map and Structs you can use the following string:
‘Column_name ARRAY<Type of elements>, Column_name MAP<Type of key, type of value>, Column_name STRUCT<col_name:type,...> ’
Conclusion:
I believe now you should have a basic understanding of schemas in Spark DataFrame API and how to create or define a schema. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.
Happy Learning! 😀
Comments
Post a Comment