Dataframes 101 - Part III

In this article, I will be covering some of the basic useful attributes and utility methods that come with the DataFrame object. As the name suggests, this article is the third part in the DataFrame 101 series. If you have not checked out the previous two articles, do check them out in the following order: Dataframes 101 Part I, Dataframes 101 Part II.

i.)columns:

columns is a property of the class DataFrame which is a list of all the column names present in the dataframe.

print(df.columns)

O/p: ['Name', 'Section', 'Gender', 'Total', 'English', 'Maths', 'Physics', 'Chemistry', 'Biology']

ii.)dtypes:

dtypes is another property of DataFrame, which is a list of two element tuples. The first element in the tuple will be the name of the column and the second element is the type of the column.

print(df.dtypes)

O/P:

[('Name', 'string'), ('Section', 'string'), ('Gender', 'string'), ('Total', 'bigint'), ('English', 'bigint'), ('Maths', 'bigint'), ('Physics', 'bigint'), ('Chemistry', 'bigint'), ('Biology', 'bigint')]

iii.) rdd:

rdd is a property of the dataframe which contains an RDD object of the Row objects. rdd themselves come with many methods and properties which allow us to work with the resulting RDD. One of the methods of RDDs that we commonly use when working with Dataframes is getNumPartitions(). This method returns how many partitions our current dataframe is split into, as DataFrames themselves are RDD of Row objects. This method does not take any arguments. In the following example I have printed the number of partitions of the dataframes using the aforementioned method:

print(df.rdd.getNumPartitions())

O/p: 8

In this case, there are a total of eight partitions for my dataframe.

iv.) repartition():

repartition() method allows us to change the number of partitions that our dataframe is split into. There are three types of partitioning in Spark. One of the partitioning strategies can be implemented using another method, the other two can be implemented using repartition(). Based on the arguments that we pass to this method, Spark understands which type of partitioning, out of the two this method supports, to implement. The method returns a newly repartitioned dataframe. Only basic usage of this method is covered in this article, for a deeper dive check out this article.

In the following example, the dataframe is repartitioned into 4 partitions from eight partitions. The getNumPartitions() introduced in the above topic is used to check before and after status.

print('Before: ',df.rdd.getNumPartitions())

df_4 = df.repartition(4)

print('After: ',df_4.rdd.getNumPartitions())

O/p:
Before: 8

After: 4

The partitioning strategy implemented here is the RoundRobinPartitioning. As I mentioned above, in depth look into partitioning and methods that allows us to do the same will be explained in another article

v.) schema

schema is a property of the DataFrame class. It contains the Spark Schema for the current dataframe as an object of the StructType class.

print('Type of the object returned is ',type(df.schema))

print(f'The schema of the dataframe is: \n{df.schema}')

vi.) printSchema()

This method prints out the schema information in a tree structure and returns None. The tree structure is beneficial particularly when we are dealing with dataframes that have columns of type Struct. Let's take the following dataframe which consists of a column of type Struct:

data = [['a',['b','b']]]

schema = 'Student_Name STRING, Parents STRUCT<Father:STRING, Mother:STRING>'

df_struct = spark.createDataFrame(data,schema)

df_struct.printSchema()

vii.)createOrReplaceTempView()

If you want to reference an existing dataframe in any of the SparkSQLs that you might be running using spark.sql(), then this method is one of the ways in which you can create a view that can be referenced in the SQL. Views in SparkSQL are exactly the same as view in standard SQL, they act like tables and can participate in queries. Within Spark, temporary view can also be called a temporary table. The views created using this method only live as long as the SparkSession in which this view was created is alive. These views are session scoped, meaning they cannot be referenced from other SparkSessions running in the same cluster or machine. In other words, the view created using this method cannot be shared between SparkSessions. The method takes one argument which will be used as the name for the view, and we can use this name in SparkSQL that is run using spark.sql(). It is important to note that this method is not the only way to create views or temporary tables that can be used in SparkSQL. This method is commonly used due to its flexibility, overwrites existing view with the same name without throwing an error. 

df.show()
df.createOrReplaceTempView('students')
df_view = spark.sql('select * from students')
df_view.show()

Obviously this view is lazily evaluated as well, just like dataframes. The views do not take up space unless we cache them, which we do for any dataframe. I will not be covering caching of views or dataframes in this article series because caching is an advanced topic of Spark, at least for beginners(for whom these articles are intended).

viii.)toDF()
This method allows us to change all column names in one shot. The method returns new dataframes with new column names and no change to the schema of the columns. The method expects the number of arguments to be the same as the number of columns in the dataframe. If the condition is not met, an exception will be thrown. We cannot pass in a collection of names, the names should be passed as individual arguments. For the following example I will be using the df_view(result of spark.sql()):

df_new = df_view.toDF('NAME','GENDER')
df_new.show()


We cannot selectively rename columns using this method. This method can be used when there is a requirement to rename the majority of the columns,but not all. In that case, for the columns for which you do not want to change the name, pass their current name in the arguments instead of a new name. 

ix.) withColumnRenamed()
The last method we saw might be helpful when we want to rename all or majority of the columns. It will involve effort to use that method to rename one or two columns. withColumnRenamed() is the method to rename a single column. This method returns a new dataframe with the same number of columns as the one from which this method was called. The method takes two arguments, the first argument is the name of the existing column and the second is the new name which we wish to rename with.

df_view.withColumnRenamed('Name','Full_Name') 
       .withColumnRenamed('Gender','GEN').show()


In the above case, I have chained two rename methods just to illustrate a very common usage of this method. If Spark is not able to find the original column name which was provided as the first argument, no error is thrown.

df_view.withColumnRenamed('Name','Full_Name') 
       .withColumnRenamed('asd','GEN').show()

x. withColumn()
In cases where we want to create a new column or overwrite an existing one withColumn() is of great help. This method takes two arguments. The first one is the name of the column, this can be a name that already does not exist(existing column overwritten) in the dataframe or a new name(in this case a new column is created). The second argument should be of type column, commonly it will be a column expression using the method expr(). The columns mentioned in the expr() method should belong to the dataframe for which the column is created or overwritten.

# creating a column
df_view.withColumn('lowercase_gender',expr('lower(Gender)')).show()


# overwriting a column
df_view.withColumn('Gender',expr('lower(Gender)')).show()

We have to be very careful when using this method, because it is performance-wise heavy. So calling this method multiple times can cause overload. If you want to add many columns use the select() method in conjunction with expr().

df_view.select('*',expr('lower(Gender) as lowercase_gender')).show()

xi.)toPandas()

This method can be used to convert a PySpark DataFrame to a Pandas dataframe. This method takes no argument. Please use this method with caution when the dataframe we are dealing with is huge in size. Because all the data is loaded in the driver memory after conversion to the Pandas DataFrame. Obviously this method only works if Pandas is installed in the machine that your trying this method on:

df_pandas = df_view.toPandas()

print(df_pandas)

xii.) Row returning methods

These methods return Row objects when called, whether they return one Row object or list of Row objects depends on the method. I have covered the usage of these methods in another article, so I will include the links to explanation of the following methods:

Conclusion:

With this article, the Dataframes 101 series comes to an end. Much of the basics of DataFrames were covered in this three part article series. To master the usage of dataframes, practice is the key. So I would suggest you to do some digging in a Jupyter notebook to cement the understanding. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.

Happy Learning! 😀

0x3eb

Search This Blog

Dataframes 101 - Part III

Comments

Post a Comment

Popular posts from this blog

Spark Streaming APIs

Introduction to Structured Streaming

Joins in Spark