Skip to main content

Column Methods Examples


In the last article I covered extensively about the basics of columns and the available methods. In this article I will be providing examples on how to work with those methods.


alias():


Column aliasing is very common in SQL of any kind. Alias method temporarily changes the name of the column for the purpose of query. The following example illustrates a simple usage of alias:


df.select(df.Total.alias('Total_Marks'),

                   df['Total'].alias('Total_Marks0)

                   col('Total').alias('Total_Marks1'),

                   expr('Total').alias('Total_Marks2')).show()


I have illustrated different ways of accessing the columns just to reiterate the point that there is not only one way to work with columns. The next example show how to add metadata to columns using the alias method:


df_new = df.select(col('Name'),col('Total').alias('Total_Marks',metadata = {'max':100}))

print(df_new.schema['Total_Marks'].metadata)

df_new.show()


O/P: {'max': 100}

asc() and desc()

The following query sorts the dataframe in ascending order of Name.


df_new.orderBy(col(‘Name’).asc()).show()

O/p:


The below example sorts the dataframe in ascending order of Total_Marks:

df_new.orderBy(col(‘Total_Marks’).asc()).show()

O/p:


The following query sorts the dataframe in descending order of Total_Marks.


df_new.orderBy(col(‘Total_Marks’).desc()).show()



The orderBy() method accepts more than one column as an argument. So the following query sorts the dataframe in both ascending order of Name and then descending order of Total_Marks.

df_new.orderBy(col(‘Name’).desc(),

       col(‘Total_Marks’).asc()).show()


astype() and  cast():


Both these methods are used to change the datatype of columns. The arguments for this method are supposed to be type classes.   


from pyspark.sql.types import *

df_new.select(col('Total_Marks').astype(DecimalType(32,9)),

              col('Total_Marks').cast(DecimalType(32,9)))

              .show()



The above example illustrates two different methods to cast the Total_Marks column to DecimalType with a precision of 32 and scale of 9.


between():


The between method is very important to simplify range conditions. 


df_new.where(df_new.Total_Marks.between(25,60)).show()


The above expression filters records for students whose Total_Marks are between 25 and 53(inclusive).



Bitwise methods:


The following are examples of bitwise methods provided in PySpark. 


df_new.select(col('Name'),col('Total_Marks').bitwiseOR(11)).show()


df_new.select(col('Name'),col('Total_Marks').bitwiseAND(11)).show()



df_new.select(col('Name'),col('Total_Marks').bitwiseXOR(11)).show()


startswith() and endswith()


As I mentioned in the earlier article, both these methods are similar to the Python string equivalents. The following example illustrates the usage on the Name column, checking for Names starting with ‘P’ and separately for Names ending with ‘a’.


df_new.select(col('Name'),col('Name').startswith('P'),

              col('Name').endswith('a')).show()


contains()

The contains method will check whether a value contains the argument that we pass. We can apply it on any type of column The first example illustrates usage with a string column.


df_new.select(col('Name'),col('Name').contains('a')).show()

Contains method on numerical column checks whether the given number is part of the column values. 

df_new.select(col('Name'),col('Total_Marks'),

              col('Total_Marks').contains(5)).show()



when() and otherwise()

 

We cannot use the when method directly on a Column object. Any such attempt will throw an error like the following:



df_new.select(col('Name'),col('Total_Marks'),

              col('Total_Marks').when(col('Total_Marks') >= 90, 'A')

              .when(col('Total_Marks') >= 80, 'B')

              .when(col('Total_Marks') >= 70, 'C')

              .when(col('Total_Marks') >= 60, 'D')

              .when(col('Total_Marks') >= 50, 'E')

              .when(col('Total_Marks') < 50,'F')

              .alias('Grade')).show()



This might be confusing because we have indeed chained when methods, each return a Column object, back to back. The error points to the highlighted region in the code, which is the start of the chain of when conditions. The starting when condition should not be passed from the Column object’s when method, rather by the method provided in pyspark.sql.functions.


from pyspark.sql.function import when —--------------> line 1


df_new.select(col('Name'),col('Total_Marks'),

              when(col('Total_Marks') >= 90, 'A')  —-----> when method 

              .when(col('Total_Marks') >= 80, 'B')         that was 

              .when(col('Total_Marks') >= 70, 'C')         imported in 

              .when(col('Total_Marks') >= 60, 'D')         line 1

              .when(col('Total_Marks') >= 50, 'E')

              .when(col('Total_Marks') < 50, 'F')

              .alias('Grade')).show()


This is the standard way to start a case when statement. Note that, the when method returns a Column object, hence we were able to apply the alias method.


The last condition, Fail grade condition, can be set as the default condition as the same grade applies for any other mark below 50.  The otherwise method is used to define the default condition. So the above query changes to the following, which follows the standard structure for the case when statement:


df_new.select('Name',col('Total_Marks'),

               when(col('Total_Marks') >= 90, 'A')

              .when(col('Total_Marks') >= 80, 'B')

              .when(col('Total_Marks') >= 70, 'C')

              .when(col('Total_Marks') >= 60, 'D')

              .when(col('Total_Marks') >= 50, 'E')

              .otherwise('F').alias('Grade')).show()



When the default condition is omitted, then a standard None is set for those conditions.


df_null = df_new.select('Name',col('Total_Marks'),

               when(col('Total_Marks') >= 90, 'A')

              .when(col('Total_Marks') >= 80, 'B')

              .when(col('Total_Marks') >= 70, 'C')

              .when(col('Total_Marks') >= 60, 'D')

              .when(col('Total_Marks') >= 50, 'E')

              .alias('Grade'))

df_null.show()


isNull() and isNotNull()


The following examples work on the df_null that was generated as a result of the above query. 


df_null.select(col('Name'),

               col('Grade'),

               col('Grade').isNull(),

               col('Grade').isNotNull()).show()


The result of the isNull and isNotNull are boolean values as expected. Now let’s take a look at their usage with the where() method of the dataframe.


df_null.select(col('Name'),col('Grade'))

       .where(col('Grade').isNull()).show()


df_null.select(col('Name'),col('Grade'))

       .where(col('Grade').isNotNull()).show()



isin():

As you might be aware this method is to perform a membership check of a column’s value with a predetermined set of values. The first example illustrates what is the result of applying isin():


df_null.select(col('Grade').isin(['A','B','C','D','E'])).show()


The null values when checked for membership result in null and not False. However when this expression is passed as an argument to the where() method, it is evaluated as False


df_null.select(col('Name'),col('Grade'))

       .where(col('Grade').isin(['A','B','C','D','E'])).show()


substr():

The substr() method seem very straight forward at a higher level, but it has nuances that we might not find in every database server’s implementation. First let me give you some examples of basic usages of this method: 


df_null.select(col('Name').substr(1,3)).show()



Now that we have covered the basic usage, let’s dive into peculiar cases. When we try to use substr method on integer column, the result we obtain is the following:


df_new = df_null.select(col('Total_Marks'),

               col('Total_Marks').substr(1,1))

df_new.show()



What happens here is that the column is first casted to StringType() and then the substr method is applied with the given arguments. Hence we can observe that only the first character is returned. The following image is a result of printing the schema of the df_new.


df_new.printSchema()



The same type of behavior is observed when it comes to date columns as well. I have created a date column which looks like the following:
Schema:

Now let’s apply the substr method on this  column and as you might expect the following is the result:


df_date.select(col('x'),col('x').substr(1,3)).show()

Schema:

like() and rlike()

like method is very useful in string pattern matching. The following query searches for name which has only three letters, each  character represented with ‘_’


df_null.select(col('Name').like('___')).show()

df_null.where(col('Name').like('___')).show()

rlike is used specifically to search for regular expression patterns. The below example searches for names that have string ‘ana’ one or more times in them 


df_null.select('Name',col('Name').rlike('(ana)+')).show()



df_null.where(col('Name').rlike('(ana)+')).show()


Conclusion:


The examples illustrated in this article should be enough to get you started. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 




Happy Learning! 😀



Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Spark Streaming APIs

  Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs.  The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Sp...