As Spark is used to deal with data, Spark supports both reading and writing of data for many different data sources. Data source denotes where the data that is used comes from. Some of the example data sources are:
HBase
Hive
MapR-DB
File system
Data Sources API is an interface to a variety of data sources and file formats.Out of which there are 6 core data sources and file format that Spark inherently supports:
CSV
JSON
Parquet
ORC
JDBC/ODBC connections
Plain-text files
These are just the core data sources and Spark is definitely not limited to interacting with only the above, thanks to the open source community. As you might have guessed Spark provides two interfaces for each of these data sources and file formats. One for reading and another for writing/loading. The two interfaces are:
DataFrameReader
DataFrameWriter
DataFrameReader:
DataFrameReader is the interface to read data from a data source. It provides various methods to configure various details about our source and file format. DataFrameReader has a recommended pattern for usage:
DataFrameReader.format(args).option("key","value")
.options(**options_dict)
.schema(args).load()
As you might have guessed, the format method should be called with the format/data source that you might want to read. The option and options methods are to pass configuration for the read, these might differ for every data source/file format and these methods are not mandatory. The schema method should be called with schema related information, this as well is not necessary in certain cases. The final load method can be called with either path(s) to be read if not specified as part of options.
One important option that would apply to every data source or file format is mode, that determines the read mode. Read modes come particularly handy when we are dealing with sources whose quality might not be the best. Malformed data can be a common occurrence when dealing with semi structured data sources. Read mode determines how Spark will deal with malformed data when they are encountered.
Detailed articles for several data sources will be published later.
DataFrameReader cannot be instantiated directly in Spark. The instantiation occurs behind the scenes when we create a SparkSession. The DataFrameReader can be accessed as an attribute of spark session object:
DataFrameWriter:
DataFrameWriter is the interface to write to data sources/ file formats. It provides various methods to configure various details about our data source and file format. DataFrameWriter has a recommended patterns for usage:
DataFrameWriter.format(args)
.options(args)
.bucketBy(args)
.partitionBy(args)
.sortBy(args)
.save(path)
DataFrameWriter.format(args)
.options(args)
.bucketBy(args)
.partitionBy(args)
.sortBy(args)
.saveAsTable(tableName)
The first pattern applies to file based data sources. The dataframe can be written to supported file formats. bucketBy method is used to bucket the data based on given columns. partitionBy method is used to partition the data based on given columns. options method can be used to specify necessary options while writing the data. sortBy sorts the data in each bucket based on the given columns. As the description suggests, we cannot use sortBy without using bucketBy. The second pattern is used when writing to Spark Tables.All the methods discussed for the above pattern apply for this pattern as well.
Spark accepts configuration that will deal with the writing process if data is found in the given writing location. This configuration is called write mode. There are four write modes.
option and options methods:
option method accepts two positional arguments, first should be the configuration/option name, second argument is the value for the specific configuration. Whereas the option method accepts keyword arguments, key being the configuration/option and value being the value for the configuration.
Conclusion:
This article should have given you the idea about the Data Sources API of Spark. If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.
Happy Learning! 😀
Comments
Post a Comment