Delimited files in Spark

Delimited files are text based file formats. Each line in the file represents a single record. One of the well known delimited file formats is CSV(Comma Separated Values). Spark allows reading and writing to delimited files. Irrespective of the delimiter, the file suffix will be “.csv”. These file formats do not have any metadata that describes the structure of the data. So this makes it tricky to work with delimited files, unless we explicitly provide all the details separately. Hence Spark provides various options to configure while dealing with delimited files. These configurations can be set with the help of option/options methods. If you are not familiar with the Data Sources API, I would suggest you refer to one of the previous articles before continuing with this article.

Let us now look at the options/configuration, some options apply to both reading and writing operations,specific to delimited files:

Option/Configuration	Applicable for	Possible values	Default	Description
sep	read&write	Any single string type character	,	Delimiter that is used in the file
header	read&write	“true” or “false”	“false”	Flag the says whether the first column in the file(s) contains column names
escape	read	Any single string type character	\	Character that defines an escape character.
inferSchema	read	“true” or “false”	“false”	Determines whether Spark will infer the schema of the data
ignoreTrailingWhiteSpace	read	“true” or “false”	“false”	Determines whether trailing spaces are ignored in values
ignoreLeadingWhiteSpace	read	“true” or “false”	“false”	Determines whether leading spaces are ignored in values
nullValue	read&write	Any single string type character	“”(empty string)	Determines what characters represent null value
nanValue	read&write	Any single string type character	NaN	Characters representing missing value/NaN
positiveInf	read&write	Any single string type character	Inf	Characters representing a positive infinite value
negativeInf	read&write	Any single string type character	-Inf	Characters representing a negative infinite value
compression	read&write	None, bzip2, uncompressed, deflate, gzip, lz4, or snappy	none	Compression program used to read/ write a file
codec	read&write	None, bzip2, uncompressed, deflate, gzip, lz4, or snappy	none	Same as compression
dateFormat	read&write	Date format string that conforms to SimpleDataFormat	yyyy-MM-dd	Standard format that would be used for representing date type columns
timestampFormat	read&write	Date format string that conforms to SimpleDataFormat	yy-MMdd’T’HH:mm :ss.SSSZZ	Standard format that would be used for representing timestamp type columns
maxColumns	read	Any integer	20480	Limits of number of columns in a file
maxCharsPerColumn	read	Any integer	1000000	Limits of maximum number of chars in a value
maxMalformedLogPerPartition	read	Any integer	10	Limits the max number of malformed records for which log entries are made Spark
multiLine	read	“true” or “false”	“false”	Specifies whether the files are multiline delimited files. Where a row might span more than one line

This table has limited options that are provided by Spark for delimited files, you can visit this page.In the next article, I will be covering examples in reading and writing Delimited files using pyspark.

Conclusion:

I believe now you should have an idea about the basic options that are provided by Spark to read and write delimited files. In the next article, I will demonstrate the reading and writing using these If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference.

Happy Learning! 😀

0x3eb

Search This Blog

Delimited files in Spark

Comments

Post a Comment

Popular posts from this blog

Rows in Spark 101

Introduction to Structured Streaming

Spark Streaming APIs