Skip to main content

Columns 101

 Every column in a Dataframe is of type Column. This Column class’s object comes with many methods with which we can interact with the columns and its data. In this article I will dive deeper into the Column object


Column:

Column objects cannot be created explicitly using the class Column()like we can create row objects using Row(). Column objects are accessed from the dataframe object. There are two ways in which we can access the column objects from a data frame. 


The first way is to access the attribute, which is the column name, of the dataframe object.

Let us take our proverbial Student table example. The dataframe looks like the following:


Let’s say that I want to access the Name column of the dataframe. 


col1 = df.Name

print(col1)

O/p:

Column<b'Name'>


The output of the above does not display the data that the column contains. Column objects themselves are not used individually to view data. Column objects are used as part of queries to include some manipulation. Another way to access columns is to index the dataframe object with either the column name or the position of the column. 


col1,col01 = df[‘Name’],df[0]

print(col1,col01)

O/p:

Column<b'Name'> Column<b'Name'>


The next two methods of working with columns are very essential when it comes to queries that involve any type of transformation. The two methods are col() and expr(). These methods are typically used in conjunction with the other methods of the dataframe. These methods return column objects.


from pyspark.sql.functions import col,expr

a,b = col('Name'),expr('Name')

print(a,b)

O/p:Column<b'Name'> Column<b'Name'>


These methods become very handy when we want to apply transformations on columns. The following example, in which we calculate average mark per subject for each student, illustrates basic transformation expressed using both methods:


df.select(col('Total')/5,expr('Total/5')).show()


Both these methods perform the same job, but in my opinion expr method is more clear and intuitive to use. expr method stands for SQL like expression, so we can call the expr method with any valid SQL like expression. If we want to see data of students with more than 50 marks as their total, we can use the following query:


df.where(col('Total')>50).show() (or) df.where(expr('Total>50’).show()



I believe now you must have an understanding about how these two methods are used to access and work with column objects.

Column Methods:


Columns come with multiple methods, these methods can be used to manipulate the columns as well as inspect them. The following tables list the available methods



Method


alias(*alias, **kwargs)

Returns the column(s) aliased with the given name(s). Accepts keyword arguments that correspond to the metadata for the column

asc()

Returns an ascending order sort expression based on the column(nulls come at the top). Typically used within orderBy().

asType(dataType)

Returns a column object that is casted into the given dataType.

between(lowerBound,upperBound)

Returns a boolean expression after evaluating the between expression on the given column (col >= lowerBound and col <= upperBound).

bitwiseAND(other)


Perform bitwise AND operation of this expression with another expression(passed as argument other)

bitwiseOR(other)

Perform bitwise OR operation of this expression with another expression(passed as argument other)

bitwiseXOR(other)

Perform bitwise XOR operation of this expression with another expression(passed as argument other)

cast(dataType)


Cast or convert the column to dataType.

contains(other)

Evaluates whether other is present inside the values of a column.  Returns a column of boolean type denoting the presence of other .

desc()

Returns an descending order sort expression based on the column. Typically used within orderBy().

endswith(other)

Works on string value columns and evaluates whether the value ends with other.Returns a column of boolean type denoting the match. Argument other can only be a string. Similar to endswith()

eqNullSafe(other)

Equality test that is safe for nulls. Returns a column of boolean type denoting the result of the check(other is a value).

isin(*args)

Checks whether each column value is a member of a given list of values. Returns a column of boolean type as result.

isNotNull()

Checks whether each value is not Null.Returns a column of boolean type as result.


isNull()

Checks whether each value is Null.Returns a column of boolean type as result.

like()

Similar to SQL LIKE operator. Do not pass regex strings to this method. Returns a column of boolean type as result.


rlike()

Similar to REGEXP or REGEXP_LIKE. Returns a column of boolean type as result.


when()

Similar to case when statements in SQL. Returns a column object with values that are the result of the given conditions.

otherwise()

Default value for a case when a statement passed as an argument to this method. It is not necessary to call this method when calling a when() method. By default the value will be None for the default case, if this method is not called. 

startswith(other)


Works on string value columns and evaluates whether the values starts with other.Returns a column of boolean type representing the match. Argument other can only be a string. Similar to startswith()

substr(startPos,length)

Similar to SUBSTR(). Returns a string column with values which are substring of the original column


The table above lists only the commonly used methods. For a full list and description of methods provided, please refer to this page. The examples of workings of these methods will be covered in the next article.


Conclusion:


Columns covered in this article should be enough to get you started. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀

Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Structured API Execution

  Spark Structured API execution can be split up into 3 high level steps: The syntax of the code is checked (common for any code) If the code is found to be valid, Spark converts this to a Logical Plan This Logical plan is then converted to Physical plan. In this process, any optimizations that are possible are applied. Spark executes this Physical Plan(RDD manipulations) on the cluster The first step does not just apply to Spark but to all types of programs. Python checks for syntax of the code written, and fails the job if there is any violation of syntax. The second step is what I would describe as Spark-specific. In this step, a sequence of three types of Logical plans are created, with the most optimal one at the end. Logical plans are abstract, with no clarity on low-level execution details. We could imagine it to be similar to a flow chart that gives a high-level idea about how operations are to be performed and in what order. As the high-level is a good start, a detailed pl...