Columns 101

Every column in a Dataframe is of type Column. This Column class’s object comes with many methods with which we can interact with the columns and its data. In this article I will dive deeper into the Column object

Column:

Column objects cannot be created explicitly using the class Column()like we can create row objects using Row(). Column objects are accessed from the dataframe object. There are two ways in which we can access the column objects from a data frame.

The first way is to access the attribute, which is the column name, of the dataframe object.

Let us take our proverbial Student table example. The dataframe looks like the following:

Let’s say that I want to access the Name column of the dataframe.

col1 = df.Name

print(col1)

O/p:

Column<b'Name'>

The output of the above does not display the data that the column contains. Column objects themselves are not used individually to view data. Column objects are used as part of queries to include some manipulation. Another way to access columns is to index the dataframe object with either the column name or the position of the column.

col1,col01 = df[‘Name’],df[0]

print(col1,col01)

O/p:

Column<b'Name'> Column<b'Name'>

The next two methods of working with columns are very essential when it comes to queries that involve any type of transformation. The two methods are col() and expr(). These methods are typically used in conjunction with the other methods of the dataframe. These methods return column objects.

from pyspark.sql.functions import col,expr

a,b = col('Name'),expr('Name')

print(a,b)

O/p:Column<b'Name'> Column<b'Name'>

These methods become very handy when we want to apply transformations on columns. The following example, in which we calculate average mark per subject for each student, illustrates basic transformation expressed using both methods:

df.select(col('Total')/5,expr('Total/5')).show()

Both these methods perform the same job, but in my opinion expr method is more clear and intuitive to use. expr method stands for SQL like expression, so we can call the expr method with any valid SQL like expression. If we want to see data of students with more than 50 marks as their total, we can use the following query:

df.where(col('Total')>50).show() (or) df.where(expr('Total>50’).show()

I believe now you must have an understanding about how these two methods are used to access and work with column objects.

Column Methods:

Columns come with multiple methods, these methods can be used to manipulate the columns as well as inspect them. The following tables list the available methods

Method
alias(alias, *kwargs)	Returns the column(s) aliased with the given name(s). Accepts keyword arguments that correspond to the metadata for the column
asc()	Returns an ascending order sort expression based on the column(nulls come at the top). Typically used within orderBy().
asType(dataType)	Returns a column object that is casted into the given dataType.
between(lowerBound,upperBound)	Returns a boolean expression after evaluating the between expression on the given column (col >= lowerBound and col <= upperBound).
bitwiseAND(other)	Perform bitwise AND operation of this expression with another expression(passed as argument other)
bitwiseOR(other)	Perform bitwise OR operation of this expression with another expression(passed as argument other)
bitwiseXOR(other)	Perform bitwise XOR operation of this expression with another expression(passed as argument other)
cast(dataType)	Cast or convert the column to dataType.
contains(other)	Evaluates whether other is present inside the values of a column. Returns a column of boolean type denoting the presence of other .
desc()	Returns an descending order sort expression based on the column. Typically used within orderBy().
endswith(other)	Works on string value columns and evaluates whether the value ends with other.Returns a column of boolean type denoting the match. Argument other can only be a string. Similar to endswith()
eqNullSafe(other)	Equality test that is safe for nulls. Returns a column of boolean type denoting the result of the check(other is a value).
isin(*args)	Checks whether each column value is a member of a given list of values. Returns a column of boolean type as result.
isNotNull()	Checks whether each value is not Null.Returns a column of boolean type as result.
isNull()	Checks whether each value is Null.Returns a column of boolean type as result.
like()	Similar to SQL LIKE operator. Do not pass regex strings to this method. Returns a column of boolean type as result.
rlike()	Similar to REGEXP or REGEXP_LIKE. Returns a column of boolean type as result.
when()	Similar to case when statements in SQL. Returns a column object with values that are the result of the given conditions.
otherwise()	Default value for a case when a statement passed as an argument to this method. It is not necessary to call this method when calling a when() method. By default the value will be None for the default case, if this method is not called.
startswith(other)	Works on string value columns and evaluates whether the values starts with other.Returns a column of boolean type representing the match. Argument other can only be a string. Similar to startswith()
substr(startPos,length)	Similar to SUBSTR(). Returns a string column with values which are substring of the original column

The table above lists only the commonly used methods. For a full list and description of methods provided, please refer to this page. The examples of workings of these methods will be covered in the next article.

Conclusion:

Columns covered in this article should be enough to get you started. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.

Happy Learning! 😀

0x3eb

Search This Blog

Columns 101

Comments

Post a Comment

Popular posts from this blog

Rows in Spark 101

Introduction to Structured Streaming

Spark Streaming APIs