Column Methods Examples

In the last article I covered extensively about the basics of columns and the available methods. In this article I will be providing examples on how to work with those methods.

alias():

Column aliasing is very common in SQL of any kind. Alias method temporarily changes the name of the column for the purpose of query. The following example illustrates a simple usage of alias:

df.select(df.Total.alias('Total_Marks'),

df['Total'].alias('Total_Marks0)

col('Total').alias('Total_Marks1'),

expr('Total').alias('Total_Marks2')).show()

I have illustrated different ways of accessing the columns just to reiterate the point that there is not only one way to work with columns. The next example show how to add metadata to columns using the alias method:

df_new = df.select(col('Name'),col('Total').alias('Total_Marks',metadata = {'max':100}))

print(df_new.schema['Total_Marks'].metadata)

df_new.show()

O/P: {'max': 100}

asc() and desc()

The following query sorts the dataframe in ascending order of Name.

df_new.orderBy(col(‘Name’).asc()).show()

O/p:

The below example sorts the dataframe in ascending order of Total_Marks:

df_new.orderBy(col(‘Total_Marks’).asc()).show()

O/p:

The following query sorts the dataframe in descending order of Total_Marks.

df_new.orderBy(col(‘Total_Marks’).desc()).show()

The orderBy() method accepts more than one column as an argument. So the following query sorts the dataframe in both ascending order of Name and then descending order of Total_Marks.

df_new.orderBy(col(‘Name’).desc(),

col(‘Total_Marks’).asc()).show()

astype() and cast():

Both these methods are used to change the datatype of columns. The arguments for this method are supposed to be type classes.

from pyspark.sql.types import *

df_new.select(col('Total_Marks').astype(DecimalType(32,9)),

col('Total_Marks').cast(DecimalType(32,9)))

.show()

The above example illustrates two different methods to cast the Total_Marks column to DecimalType with a precision of 32 and scale of 9.

between():

The between method is very important to simplify range conditions.

df_new.where(df_new.Total_Marks.between(25,60)).show()

The above expression filters records for students whose Total_Marks are between 25 and 53(inclusive).

Bitwise methods:

The following are examples of bitwise methods provided in PySpark.

df_new.select(col('Name'),col('Total_Marks').bitwiseOR(11)).show()

df_new.select(col('Name'),col('Total_Marks').bitwiseAND(11)).show()

df_new.select(col('Name'),col('Total_Marks').bitwiseXOR(11)).show()

startswith() and endswith()

As I mentioned in the earlier article, both these methods are similar to the Python string equivalents. The following example illustrates the usage on the Name column, checking for Names starting with ‘P’ and separately for Names ending with ‘a’.

df_new.select(col('Name'),col('Name').startswith('P'),

col('Name').endswith('a')).show()

contains()

The contains method will check whether a value contains the argument that we pass. We can apply it on any type of column The first example illustrates usage with a string column.

df_new.select(col('Name'),col('Name').contains('a')).show()

Contains method on numerical column checks whether the given number is part of the column values.

df_new.select(col('Name'),col('Total_Marks'),

col('Total_Marks').contains(5)).show()

when() and otherwise()

We cannot use the when method directly on a Column object. Any such attempt will throw an error like the following:

df_new.select(col('Name'),col('Total_Marks'),

col('Total_Marks').when(col('Total_Marks') >= 90, 'A')

.when(col('Total_Marks') >= 80, 'B')

.when(col('Total_Marks') >= 70, 'C')

.when(col('Total_Marks') >= 60, 'D')

.when(col('Total_Marks') >= 50, 'E')

.when(col('Total_Marks') < 50,'F')

.alias('Grade')).show()

This might be confusing because we have indeed chained when methods, each return a Column object, back to back. The error points to the highlighted region in the code, which is the start of the chain of when conditions. The starting when condition should not be passed from the Column object’s when method, rather by the method provided in pyspark.sql.functions.

from pyspark.sql.function import when —--------------> line 1

df_new.select(col('Name'),col('Total_Marks'),

when(col('Total_Marks') >= 90, 'A') —-----> when method

.when(col('Total_Marks') >= 80, 'B') that was

.when(col('Total_Marks') >= 70, 'C') imported in

.when(col('Total_Marks') >= 60, 'D') line 1

.when(col('Total_Marks') >= 50, 'E')

.when(col('Total_Marks') < 50, 'F')

.alias('Grade')).show()

This is the standard way to start a case when statement. Note that, the when method returns a Column object, hence we were able to apply the alias method.

The last condition, Fail grade condition, can be set as the default condition as the same grade applies for any other mark below 50. The otherwise method is used to define the default condition. So the above query changes to the following, which follows the standard structure for the case when statement:

df_new.select('Name',col('Total_Marks'),

when(col('Total_Marks') >= 90, 'A')

.when(col('Total_Marks') >= 80, 'B')

.when(col('Total_Marks') >= 70, 'C')

.when(col('Total_Marks') >= 60, 'D')

.when(col('Total_Marks') >= 50, 'E')

.otherwise('F').alias('Grade')).show()

When the default condition is omitted, then a standard None is set for those conditions.

df_null = df_new.select('Name',col('Total_Marks'),

when(col('Total_Marks') >= 90, 'A')

.when(col('Total_Marks') >= 80, 'B')

.when(col('Total_Marks') >= 70, 'C')

.when(col('Total_Marks') >= 60, 'D')

.when(col('Total_Marks') >= 50, 'E')

.alias('Grade'))

df_null.show()

isNull() and isNotNull()

The following examples work on the df_null that was generated as a result of the above query.

df_null.select(col('Name'),

col('Grade'),

col('Grade').isNull(),

col('Grade').isNotNull()).show()

The result of the isNull and isNotNull are boolean values as expected. Now let’s take a look at their usage with the where() method of the dataframe.

df_null.select(col('Name'),col('Grade'))

.where(col('Grade').isNull()).show()

df_null.select(col('Name'),col('Grade'))

.where(col('Grade').isNotNull()).show()

isin():

As you might be aware this method is to perform a membership check of a column’s value with a predetermined set of values. The first example illustrates what is the result of applying isin():

df_null.select(col('Grade').isin(['A','B','C','D','E'])).show()

The null values when checked for membership result in null and not False. However when this expression is passed as an argument to the where() method, it is evaluated as False.

df_null.select(col('Name'),col('Grade'))

.where(col('Grade').isin(['A','B','C','D','E'])).show()

substr():

The substr() method seem very straight forward at a higher level, but it has nuances that we might not find in every database server’s implementation. First let me give you some examples of basic usages of this method:

df_null.select(col('Name').substr(1,3)).show()

Now that we have covered the basic usage, let’s dive into peculiar cases. When we try to use substr method on integer column, the result we obtain is the following:

df_new = df_null.select(col('Total_Marks'),

col('Total_Marks').substr(1,1))

df_new.show()

What happens here is that the column is first casted to StringType() and then the substr method is applied with the given arguments. Hence we can observe that only the first character is returned. The following image is a result of printing the schema of the df_new.

df_new.printSchema()

The same type of behavior is observed when it comes to date columns as well. I have created a date column which looks like the following:

Schema:

Now let’s apply the substr method on this column and as you might expect the following is the result:

df_date.select(col('x'),col('x').substr(1,3)).show()

Schema:

like() and rlike()

like method is very useful in string pattern matching. The following query searches for name which has only three letters, each character represented with ‘_’

df_null.select(col('Name').like('___')).show()

df_null.where(col('Name').like('___')).show()

rlike is used specifically to search for regular expression patterns. The below example searches for names that have string ‘ana’ one or more times in them

df_null.select('Name',col('Name').rlike('(ana)+')).show()

df_null.where(col('Name').rlike('(ana)+')).show()

Conclusion:

The examples illustrated in this article should be enough to get you started. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.

Happy Learning! 😀

0x3eb

Search This Blog

Column Methods Examples

Comments

Post a Comment

Popular posts from this blog

Rows in Spark 101

Introduction to Structured Streaming

Spark Streaming APIs