Skip to main content

Dataframe Views in Spark

 As I mentioned in the article Major APIs in Spark, SQL Tables and Views are one of the types of distributed collections in Spark. Views in particular are virtual tables that have no physical data stored with it. A view specifies a set of transformations (important to remember read is a transformation as well) on either an existing datasource or table. Views are created commonly when there is a need to apply transformation logic in SQL format. It is important to note that Views do not persist data and only metadata is registered for the view name, which helps to generate the data. Whenever the view is called as part of the SQL statement, the view is evaluated and the data returned is used in the SQL logic. There are two different types of views that can be created:


  1. Temporary View

  2. Global Temporary View


The dataset that is used for the examples in the article can be generated with the following source code:


data_students = [('Raj',1,'B',95,15,20,20,20,20),

        ('Hema',2,'G',67,9, 18, 3, 19, 18),

        ('Joju',2,'B',44,7,8,13,6,10),

        ('Priya',1,'G',45,7,5,10,19,4),

        ('Sanjana',2,'G',61,18,17,5,3,18),

        ('Ashok',1,'B',70,7,18,14,11,20),

        ('Chitra',1,'G',66,20,17,17,7,5),

        ('Prem',2,'B',29,6,3,10,3,7),

        ('Pavan',2,'B',53,3,18,5,16,11)]

columns_students = ['Name','Section_ID','Gender',

    'Total','English','Maths',

                    'Physics','Chemistry','Biology']

df_students = spark.createDataFrame(data_students,columns_students)


Temporary View


PySpark provides two methods to create a normal temporary view. The views live as long as the SparkSession in which this view was created is alive. These views are session scoped, meaning they cannot be referenced from other SparkSessions running in the same cluster or machine. In other words, the view created using this method cannot be shared between SparkSessions. The method takes one argument which will be used as the name for the view, and we can use this name in SparkSQL that is run using spark.sql(). Dataframe Methods  that can be used to create Temporary Views are:

  1. createTempView()

  2. createOrReplaceTempView()


i.)createTempView()

This method should be used only when we want to create a new view from an existing dataframe. The method accepts only one argument, which is the name of the view. The method’s return value is None. This method will not overwrite existing views. So if the method finds a view with the same name exists, it throws an exception


print(df_students.createTempView('Students'))

spark.sql('select * from Students').show()


In the following example, same createTempView() statement from the previous example is run and the following exception is thrown:


df_students.createTempView('Students')

O/p:

AnalysisException: Temporary view 'Students' already exists;


ii.) createOrReplaceTempView()

This method does everything the above method does but the only difference is that we can overwrite existing views. This method takes only one input argument which is the name of the view. The method’s return type is None.


print(df_students.createOrReplaceTempView('Students'))

O/p: None


You must now be aware how these two methods work. It is important to remember that one Spark Application can have multiple SparkSession instantiated inside it. Major difference between a temp view and the Global Temp View is that we cannot access a Temp View created in one SparkSession from another SparkSession in the same application. 


x = spark.newSession()

x.sql('select * from Students').show()

O/P:

AnalysisException: Table or view not found: Students; line 1 pos 14;

'Project [*]

+- 'UnresolvedRelation [Students]


In the above example, a new session was created using a method provided with the SparkSession object called newSession().The method does not take any arguments. Once the method was created, a try to access the view Students was done and it failed. This is because of the property of the temp view that was mentioned above. Temp Views are session scoped.


Global Temporary View


Spark provides two similar methods(as in normal temporary view) to create Global Temporary Views. Global Temporary Views are Spark Application scoped, the views created by one SparkSession within a Spark Application can be accessed using another SparkSession that was created in the same Spark App. Global temp views differ in the way they are accessed. For instance, if there is a global temp view with the name Students, it cannot be accessed as follows:


spark.sql(‘select * from Students’).show() # using the session object in

 # which the view was created

x.sql(‘select * from Students’).show() # using a different session object 

   # within the same app


Both of these method calls throw the same exception: 

AnalysisException: Table or view not found: Students; line 1 pos 14;

'Project [*]

+- 'UnresolvedRelation [Students]


Which is a straight forward view not found exception. We can access the view created by using a prefix: “global_temp.”, which is the database inside which global temp views can be found. I am aware that the concept of database has not been covered in any of the prior articles, hence I will cover this extensively in the future. Database in Spark is logical grouping of related tables and views. For the current context the above explanation should suffice. Applying this to the method calls above results the dataframes correctly displayed:


The two methods available to us to create Global Temp views are:

  1. createGlobalTempView()

  2. createOrReplaceGlobalTempView()


Both of these methods take the same argument, which is the name of the view we wish to create. These methods have a return value None. The first method does not overwrite existing views and throws the same error that was seen in case of createTempView(). The second method helps in creating and overwriting global views. 


df_students.createGlobalTempView('Students')

spark.sql('select * from global_temp.Students').show()


x = spark.newSession() # creating a new session

x.sql('select * from global_temp.Students').show()



Now that a view with the name Students has been created, if the same call as in the above example is made now, the following error is produced:


AnalysisException: Temporary view 'students' already exists;


The following examples show that createOrReplaceGlobalTempView() method can be called to overwrite existing views:


df_students.createOrReplaceGlobalTempView('Students')

spark.sql('select * from global_temp.Students').show()

x.sql('select * from global_temp.Students').show()



Now that the views are created, it is important to know how to remove views that are not used or unnecessary. Removing views can be done using two methods of Catalog. Catalog is the interface through which Spark allows us to create or modify Spark Databases, Tables, Functions. The Catalog object is an attribute of the SparkSession object, so it can be accessed like any other attribute of a class.


print(spark.catalog)

O/p:

<pyspark.sql.catalog.Catalog object at 0x000001B5CC691360>


The two methods that can be used to drop the views from the Catalog are:


  1. dropTempView()

  2. dropGlobalTempView()


dropTempView(), is used to drop normal temporary views. dropGlobalTempView() is used to drop Global Temporary View. Both of these methods return True if the view is dropped, else False is returned. Both of these methods do not throw an Exception when a view that does not exist is tried to be dropped


df_students.createOrReplaceTempView('Students')

spark.catalog.dropTempView('Students')

spark.catalog.dropTempView('Students')


As we can see in the above example, when the dropTempView() is called twice with the same argument there is no Exception thrown.



Global Temporary Views can be dropped using any session within the same Spark Application:


x.catalog.dropGlobalTempView('Students')


Conclusion:


Creating views opens up a new possibility of representing our business logic using SQL. The procedure is simple but knowing the differences between the two types of views is crucial. Try experimenting by yourself to confirm the properties of the views. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀

Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Spark Streaming APIs

  Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs.  The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Sp...