Skip to main content

Columns 101

 Every column in a Dataframe is of type Column. This Column class’s object comes with many methods with which we can interact with the columns and its data. In this article I will dive deeper into the Column object


Column:

Column objects cannot be created explicitly using the class Column()like we can create row objects using Row(). Column objects are accessed from the dataframe object. There are two ways in which we can access the column objects from a data frame. 


The first way is to access the attribute, which is the column name, of the dataframe object.

Let us take our proverbial Student table example. The dataframe looks like the following:


Let’s say that I want to access the Name column of the dataframe. 


col1 = df.Name

print(col1)

O/p:

Column<b'Name'>


The output of the above does not display the data that the column contains. Column objects themselves are not used individually to view data. Column objects are used as part of queries to include some manipulation. Another way to access columns is to index the dataframe object with either the column name or the position of the column. 


col1,col01 = df[‘Name’],df[0]

print(col1,col01)

O/p:

Column<b'Name'> Column<b'Name'>


The next two methods of working with columns are very essential when it comes to queries that involve any type of transformation. The two methods are col() and expr(). These methods are typically used in conjunction with the other methods of the dataframe. These methods return column objects.


from pyspark.sql.functions import col,expr

a,b = col('Name'),expr('Name')

print(a,b)

O/p:Column<b'Name'> Column<b'Name'>


These methods become very handy when we want to apply transformations on columns. The following example, in which we calculate average mark per subject for each student, illustrates basic transformation expressed using both methods:


df.select(col('Total')/5,expr('Total/5')).show()


Both these methods perform the same job, but in my opinion expr method is more clear and intuitive to use. expr method stands for SQL like expression, so we can call the expr method with any valid SQL like expression. If we want to see data of students with more than 50 marks as their total, we can use the following query:


df.where(col('Total')>50).show() (or) df.where(expr('Total>50’).show()



I believe now you must have an understanding about how these two methods are used to access and work with column objects.

Column Methods:


Columns come with multiple methods, these methods can be used to manipulate the columns as well as inspect them. The following tables list the available methods



Method


alias(*alias, **kwargs)

Returns the column(s) aliased with the given name(s). Accepts keyword arguments that correspond to the metadata for the column

asc()

Returns an ascending order sort expression based on the column(nulls come at the top). Typically used within orderBy().

asType(dataType)

Returns a column object that is casted into the given dataType.

between(lowerBound,upperBound)

Returns a boolean expression after evaluating the between expression on the given column (col >= lowerBound and col <= upperBound).

bitwiseAND(other)


Perform bitwise AND operation of this expression with another expression(passed as argument other)

bitwiseOR(other)

Perform bitwise OR operation of this expression with another expression(passed as argument other)

bitwiseXOR(other)

Perform bitwise XOR operation of this expression with another expression(passed as argument other)

cast(dataType)


Cast or convert the column to dataType.

contains(other)

Evaluates whether other is present inside the values of a column.  Returns a column of boolean type denoting the presence of other .

desc()

Returns an descending order sort expression based on the column. Typically used within orderBy().

endswith(other)

Works on string value columns and evaluates whether the value ends with other.Returns a column of boolean type denoting the match. Argument other can only be a string. Similar to endswith()

eqNullSafe(other)

Equality test that is safe for nulls. Returns a column of boolean type denoting the result of the check(other is a value).

isin(*args)

Checks whether each column value is a member of a given list of values. Returns a column of boolean type as result.

isNotNull()

Checks whether each value is not Null.Returns a column of boolean type as result.


isNull()

Checks whether each value is Null.Returns a column of boolean type as result.

like()

Similar to SQL LIKE operator. Do not pass regex strings to this method. Returns a column of boolean type as result.


rlike()

Similar to REGEXP or REGEXP_LIKE. Returns a column of boolean type as result.


when()

Similar to case when statements in SQL. Returns a column object with values that are the result of the given conditions.

otherwise()

Default value for a case when a statement passed as an argument to this method. It is not necessary to call this method when calling a when() method. By default the value will be None for the default case, if this method is not called. 

startswith(other)


Works on string value columns and evaluates whether the values starts with other.Returns a column of boolean type representing the match. Argument other can only be a string. Similar to startswith()

substr(startPos,length)

Similar to SUBSTR(). Returns a string column with values which are substring of the original column


The table above lists only the commonly used methods. For a full list and description of methods provided, please refer to this page. The examples of workings of these methods will be covered in the next article.


Conclusion:


Columns covered in this article should be enough to get you started. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀

Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Spark Streaming APIs

  Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs.  The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Sp...