Every column in a Dataframe is of type Column. This Column class’s object comes with many methods with which we can interact with the columns and its data. In this article I will dive deeper into the Column object
Column:
Column objects cannot be created explicitly using the class Column()like we can create row objects using Row(). Column objects are accessed from the dataframe object. There are two ways in which we can access the column objects from a data frame.
The first way is to access the attribute, which is the column name, of the dataframe object.
Let us take our proverbial Student table example. The dataframe looks like the following:
Let’s say that I want to access the Name column of the dataframe.
col1 = df.Name
print(col1)
O/p:
Column<b'Name'>
The output of the above does not display the data that the column contains. Column objects themselves are not used individually to view data. Column objects are used as part of queries to include some manipulation. Another way to access columns is to index the dataframe object with either the column name or the position of the column.
col1,col01 = df[‘Name’],df[0]
print(col1,col01)
O/p:
Column<b'Name'> Column<b'Name'>
The next two methods of working with columns are very essential when it comes to queries that involve any type of transformation. The two methods are col() and expr(). These methods are typically used in conjunction with the other methods of the dataframe. These methods return column objects.
from pyspark.sql.functions import col,expr
a,b = col('Name'),expr('Name')
print(a,b)
O/p:Column<b'Name'> Column<b'Name'>
These methods become very handy when we want to apply transformations on columns. The following example, in which we calculate average mark per subject for each student, illustrates basic transformation expressed using both methods:
df.select(col('Total')/5,expr('Total/5')).show()
Both these methods perform the same job, but in my opinion expr method is more clear and intuitive to use. expr method stands for SQL like expression, so we can call the expr method with any valid SQL like expression. If we want to see data of students with more than 50 marks as their total, we can use the following query:
df.where(col('Total')>50).show() (or) df.where(expr('Total>50’).show()
I believe now you must have an understanding about how these two methods are used to access and work with column objects.
Column Methods:
Columns come with multiple methods, these methods can be used to manipulate the columns as well as inspect them. The following tables list the available methods
The table above lists only the commonly used methods. For a full list and description of methods provided, please refer to this page. The examples of workings of these methods will be covered in the next article.
Conclusion:
Columns covered in this article should be enough to get you started. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.
Comments
Post a Comment