Skip to main content

External Tables 101

 This article requires you to have a basic knowledge of Tables in Spark, which has been covered in this article. If you have not checked out that article, I kindly request you to do so before you proceed any further in this article. But if you are confident enough with Spark Tables, please proceed. 


External tables are tables whose data is managed by us and the metadata management is left to Spark. We can create external tables using multiple data sources supported by Spark, ex: CSV, Parquet, AVRO etc.. An external table acts like a pointer to the actual data source. For instance we can create an external table using JDBC data source which allows the data to be stored in a Database like MySQL. So any change made to the external table in spark will affect the data in MySQL. The basic syntax to create an external table is as follows:


CREATE TABLE [ IF NOT EXISTS ] table_identifier(col1 type1, col2 type2 …) USING data_source

LOCATION path 


(OR) 


CREATE TABLE [ IF NOT EXISTS ] table_identifier(col1 type1, col2 type2 …) USING data_source OPTIONS (path  '...')



What really distinguishes the managed and unmanaged table creation syntax is the inclusion of a data source and an external location. Irrespective of which of the above syntaxes we use. The following example creates two tables using the two different syntaxes. Both result in similar tables and the creation of the tables can be verified with listTables() method. As you can see, the tableType attributes for these two tables are set as EXTERNAL. The clause IF NOT EXISTS  makes sure no exception is thrown if there is already a table with the same name that exists, which is usually the case if the clause is not used.


spark.sql("""CREATE TABLE Students(ID INT, Name STRING) using CSV 

LOCATION 'C:/Users/path/Students/'""")


spark.sql("""CREATE TABLE Students_ext(ID INT, Name STRING) using CSV 

OPTIONS(path 'C:/Users/path/Students_ext/')""")


print(spark.catalog.listTables())


O/P: [Table(name='students', database='default', description=None, tableType='EXTERNAL', isTemporary=False),

 Table(name='students_ext', database='default', description=None, tableType='EXTERNAL', isTemporary=False)]


Using these statements, we can also point to directories that do not exist. In that case, Spark will take care of creating the required directories and create an empty unmanaged table. In the above case, this was the scenario. Both the tables point to empty directories created by Spark. Any data inserted to these tables will result in creation of CSV files, which is the mentioned data source. Insertion of data into an external table is no different from insertion of data into a managed table. So performing an insert statement like the following will result in creation of a CSV file to store the data.


spark.sql('''INSERT INTO Students_ext VALUES(1,'Helen')''')



Providing a data source is not mandatory when creating external tables but it has significant importance. When we do not provide a data source, Spark defaults to Hive SerDe configuration. Based on the configuration the file format is decided. In the following example, there is no definition of data source, hence the file format is decided based on Hive SerDe configuration. Hive SerDes are much slower when compared to Spark’s native serialization and this might cause performance issues to the apps that are reading or writing these files. So it is recommended to provide data_source while writing the CREATE TABLE statement. It is also important to note that the distinction between a managed table creation and unmanaged table creation is the specification of the LOCATION sub clause or path option followed by a valid path. We are allowed to specify the data_source even when we create managed tables. 


# without providing data source

spark.sql("""CREATE TABLE  

Students_ext1(ID INT, Name STRING) LOCATION 'C:/path/Students_ext1/'""")

spark.sql('''INSERT INTO Students_ext1 VALUES(1,'Helen')''')



# without providing LOCATION or path

spark.sql("""CREATE TABLE Students_ext2(ID INT, Name STRING) USING PARQUET""")

print(spark.catalog.listTables())



We can also create an unmanaged table using the DataFrame API as well.


df.write.mode('overwrite/append')

        .format('format')

        .option('path','path_to_external_table')

        .saveAsTable('name of the table')


The above method writes the contents of the dataframe to a table, either existing or not. If the table does not exist, then the table is created and then the data is written to it. We can choose to either overwrite or append when writing using the mode method. For external tables, it is mandatory to provide the path option, which is the path where the table can be found or created. Without providing the path option, the table that is created will be a managed table. The following example illustrates the importance of the path option.


df_teachers.write.mode('overwrite').format('parquet') \

    .option('path','C:/path/to/Teachers_ext/') \

    .saveAsTable('Teachers')

df_teachers.write.mode('overwrite').format('parquet') \

    .saveAsTable('Teachers_internal')

print(spark.catalog.listTables())



The first table Teachers is created as an external table because the path option was provied. In the second case, a Managed table is created because there is no path specified and in this case the data is stored in the location provided for the config spark.sql.warehouse.dir.


The default format for the files written using DataFrameWriter is parquet. When we use the DataFrameWriter to write to tables, if we do not specify the format explicitly then parquet files are written into the respective location. The following example does not explicitly mention the format and hence parquet files are into the path provided.

df_teachers.write.mode('overwrite').option('path','C:/path/to/Teachers_default/') \

    .saveAsTable('Teachers_default')



Inserting data to unmanaged tables works identical to the corresponding operation on managed tables. Truncate operation is not allowed for external tables. 


Dropping tables is slightly different from what was observed in case of managed tables. The syntax is the same for unmanaged tables, but the statement only removes the metadata registered and not the actual data. If we want the data to be deleted, we cannot do it via Spark.


spark.sql('''DROP TABLE teachers_default''')

spark.catalog.listTables()



As we can see, the entry for teachers_default external table has been removed from the catalog. The location where the data was stored is unaffected:



Conclusion;


External tables are just an extension to the managed tables concept and usage of external tables is very limited. I hope this article was a good place to start. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 



Happy Learning! 😀


Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Spark Streaming APIs

  Streaming applications and some common designs of such applications had been discussed in the last two articles: article1 and article2. Spark, which is a powerful processing engine for batch data, also provides support for streaming data. The support comes as two related streaming APIs.  The earliest stream processing API in Spark is called DStream API and it takes the micro-batch processing approach over the record-by-record processing. The second design pattern that the DStream API follows is that it takes the declarative programming approach rather than the imperative approach. Finally the Dstream API only supports processing-time processing. For event-time use-cases, we have to implement our own version depending upon the use-case. One important distinction of DStream API is that it works on Java or Python objects instead of the familiar table-like abstractions such as Dataframe or Dataset. Since the objects are tied to their language, the amount of optimizations that Sp...