Dataframe Views in Spark

As I mentioned in the article Major APIs in Spark, SQL Tables and Views are one of the types of distributed collections in Spark. Views in particular are virtual tables that have no physical data stored with it. A view specifies a set of transformations (important to remember read is a transformation as well) on either an existing datasource or table. Views are created commonly when there is a need to apply transformation logic in SQL format. It is important to note that Views do not persist data and only metadata is registered for the view name, which helps to generate the data. Whenever the view is called as part of the SQL statement, the view is evaluated and the data returned is used in the SQL logic. There are two different types of views that can be created:

Temporary View
Global Temporary View

The dataset that is used for the examples in the article can be generated with the following source code:

data_students = [('Raj',1,'B',95,15,20,20,20,20),

('Hema',2,'G',67,9, 18, 3, 19, 18),

('Joju',2,'B',44,7,8,13,6,10),

('Priya',1,'G',45,7,5,10,19,4),

('Sanjana',2,'G',61,18,17,5,3,18),

('Ashok',1,'B',70,7,18,14,11,20),

('Chitra',1,'G',66,20,17,17,7,5),

('Prem',2,'B',29,6,3,10,3,7),

('Pavan',2,'B',53,3,18,5,16,11)]

columns_students = ['Name','Section_ID','Gender',

'Total','English','Maths',

'Physics','Chemistry','Biology']

df_students = spark.createDataFrame(data_students,columns_students)

Temporary View

PySpark provides two methods to create a normal temporary view. The views live as long as the SparkSession in which this view was created is alive. These views are session scoped, meaning they cannot be referenced from other SparkSessions running in the same cluster or machine. In other words, the view created using this method cannot be shared between SparkSessions. The method takes one argument which will be used as the name for the view, and we can use this name in SparkSQL that is run using spark.sql(). Dataframe Methods that can be used to create Temporary Views are:

createTempView()
createOrReplaceTempView()

i.)createTempView()

This method should be used only when we want to create a new view from an existing dataframe. The method accepts only one argument, which is the name of the view. The method’s return value is None. This method will not overwrite existing views. So if the method finds a view with the same name exists, it throws an exception.

print(df_students.createTempView('Students'))

spark.sql('select * from Students').show()

In the following example, same createTempView() statement from the previous example is run and the following exception is thrown:

df_students.createTempView('Students')

O/p:

AnalysisException: Temporary view 'Students' already exists;

ii.) createOrReplaceTempView()

This method does everything the above method does but the only difference is that we can overwrite existing views. This method takes only one input argument which is the name of the view. The method’s return type is None.

print(df_students.createOrReplaceTempView('Students'))

O/p: None

You must now be aware how these two methods work. It is important to remember that one Spark Application can have multiple SparkSession instantiated inside it. Major difference between a temp view and the Global Temp View is that we cannot access a Temp View created in one SparkSession from another SparkSession in the same application.

x = spark.newSession()

x.sql('select * from Students').show()

O/P:

AnalysisException: Table or view not found: Students; line 1 pos 14;

'Project [*]

+- 'UnresolvedRelation [Students]

In the above example, a new session was created using a method provided with the SparkSession object called newSession().The method does not take any arguments. Once the method was created, a try to access the view Students was done and it failed. This is because of the property of the temp view that was mentioned above. Temp Views are session scoped.

Global Temporary View

Spark provides two similar methods(as in normal temporary view) to create Global Temporary Views. Global Temporary Views are Spark Application scoped, the views created by one SparkSession within a Spark Application can be accessed using another SparkSession that was created in the same Spark App. Global temp views differ in the way they are accessed. For instance, if there is a global temp view with the name Students, it cannot be accessed as follows:

spark.sql(‘select * from Students’).show() # using the session object in

# which the view was created

x.sql(‘select * from Students’).show() # using a different session object

# within the same app

Both of these method calls throw the same exception:

AnalysisException: Table or view not found: Students; line 1 pos 14;

'Project [*]

+- 'UnresolvedRelation [Students]

Which is a straight forward view not found exception. We can access the view created by using a prefix: “global_temp.”, which is the database inside which global temp views can be found. I am aware that the concept of database has not been covered in any of the prior articles, hence I will cover this extensively in the future. Database in Spark is logical grouping of related tables and views. For the current context the above explanation should suffice. Applying this to the method calls above results the dataframes correctly displayed:

The two methods available to us to create Global Temp views are:

createGlobalTempView()
createOrReplaceGlobalTempView()

Both of these methods take the same argument, which is the name of the view we wish to create. These methods have a return value None. The first method does not overwrite existing views and throws the same error that was seen in case of createTempView(). The second method helps in creating and overwriting global views.

df_students.createGlobalTempView('Students')

spark.sql('select * from global_temp.Students').show()

x = spark.newSession() # creating a new session

x.sql('select * from global_temp.Students').show()

Now that a view with the name Students has been created, if the same call as in the above example is made now, the following error is produced:

AnalysisException: Temporary view 'students' already exists;

The following examples show that createOrReplaceGlobalTempView() method can be called to overwrite existing views:

df_students.createOrReplaceGlobalTempView('Students')

spark.sql('select * from global_temp.Students').show()

x.sql('select * from global_temp.Students').show()

Now that the views are created, it is important to know how to remove views that are not used or unnecessary. Removing views can be done using two methods of Catalog. Catalog is the interface through which Spark allows us to create or modify Spark Databases, Tables, Functions. The Catalog object is an attribute of the SparkSession object, so it can be accessed like any other attribute of a class.

print(spark.catalog)

O/p:

<pyspark.sql.catalog.Catalog object at 0x000001B5CC691360>

The two methods that can be used to drop the views from the Catalog are:

dropTempView()
dropGlobalTempView()

dropTempView(), is used to drop normal temporary views. dropGlobalTempView() is used to drop Global Temporary View. Both of these methods return True if the view is dropped, else False is returned. Both of these methods do not throw an Exception when a view that does not exist is tried to be dropped

df_students.createOrReplaceTempView('Students')

spark.catalog.dropTempView('Students')

As we can see in the above example, when the dropTempView() is called twice with the same argument there is no Exception thrown.

Global Temporary Views can be dropped using any session within the same Spark Application:

x.catalog.dropGlobalTempView('Students')

Conclusion:

Creating views opens up a new possibility of representing our business logic using SQL. The procedure is simple but knowing the differences between the two types of views is crucial. Try experimenting by yourself to confirm the properties of the views. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.

Happy Learning! 😀

0x3eb

Search This Blog

Dataframe Views in Spark

Comments

Post a Comment

Popular posts from this blog

Rows in Spark 101

Introduction to Structured Streaming

Structured API Execution