Skip to main content

Delimited files in Spark

Delimited files are text based file formats. Each line in the file represents a single record. One of the well known delimited file formats is CSV(Comma Separated Values). Spark allows reading and writing to delimited files. Irrespective of the delimiter, the file suffix will be “.csv”. These file formats do not have any metadata that describes the structure of the data. So this makes it tricky to work with delimited files, unless we explicitly provide all the details separately. Hence Spark provides various options to configure while dealing with delimited files. These configurations can be set with the help of option/options methods. If you are not familiar with the Data Sources API, I would suggest you refer to one of the previous articles before continuing with this article.


Let us now look at the options/configuration, some options apply to both reading and writing operations,specific to delimited files:



Option/Configuration

Applicable for

Possible values

Default

Description

sep

read&write

Any single string type character

,

Delimiter that is used in the file

header

read&write

“true” or “false”

“false”

Flag the says whether the first column in the file(s) contains column names

escape

read

Any single string type character

\

Character that defines an escape character.

inferSchema

read

“true” or “false”

“false”

Determines whether Spark will infer the schema of the data

ignoreTrailingWhiteSpace

read

“true” or “false”

“false”

Determines whether trailing spaces are ignored in values

ignoreLeadingWhiteSpace

read

“true” or “false”

“false”

Determines whether leading spaces are ignored in values

nullValue

read&write

Any single string type character

“”(empty string)

Determines what characters represent null value

nanValue

read&write

Any single string type character

NaN

Characters representing missing value/NaN

positiveInf

read&write

Any single string type character

Inf

Characters representing a positive infinite value

negativeInf

read&write

Any single string type character

-Inf

Characters representing a negative infinite value

compression

read&write

None, bzip2,

uncompressed, deflate, gzip, lz4, or snappy 

none

Compression program used to read/ write a file

codec

read&write

None, bzip2,

uncompressed, deflate, gzip, lz4, or snappy 

none

Same as compression

dateFormat

read&write

Date format string that conforms to SimpleDataFormat

yyyy-MM-dd

Standard format that would be used for representing date type columns

timestampFormat

read&write

Date format string that conforms to SimpleDataFormat

yy-MMdd’T’HH:mm :ss.SSSZZ

Standard format that would be used for representing timestamp type columns

maxColumns

read

Any integer

20480

Limits of number of columns in a file

maxCharsPerColumn

read

Any integer

1000000

Limits of maximum number of chars in a value

maxMalformedLogPerPartition

read

Any integer

10

Limits the max number of malformed records for which  log entries are made Spark 

multiLine

read

“true” or “false”

“false”

Specifies whether the files are multiline delimited files. Where a row might span more than one line


This table has limited options that are provided by Spark for delimited files, you can visit this page.In the next article, I will be covering examples in reading and writing Delimited files using pyspark.


Conclusion:


I believe now you should have an idea about the basic options that are provided by Spark to read and write delimited files. In the next article, I will demonstrate the reading and writing using these If you have any suggestions or questions please post it in the comment box. This article, very much like every article in this blog, will be updated based on comments and as I find better ways of explaining things. So kindly bookmark this page and checkout whenever you need some reference. 


Happy Learning! 😀


Comments

Popular posts from this blog

Rows in Spark 101

  Every row in a Dataframe is of type Row and this object comes with many methods with which we can interact with the data that it consists of. In this article I will dive deeper into the Row object Row: The objects of Row class can hold multiple values that correspond to columns. We can create Row objects separately or access Row objects from the existing dataframe. The Row class can be imported as follows: from pyspark.sql import Row  There are several methods that can help retrieve row objects from the dataframe. Some of them are listed below: df.collect() : This method returns a list of all rows that are present in the dataframe. The result is returned to the driver, so we should be careful while running this method as the driver might not have sufficient memory to hold all the Row objects.  df.take(num) : This method returns a list of a specific number of records, specified by the argument num that we are allowed to pass. The results are returned to the driver, so ...

Introduction to Structured Streaming

  Structured Streaming is one of the APIs provided by Spark to address stream processing needs. The API is based on the familiar Structured APIs, so the data are treated as tables with only one important distinction which is that it is a table to which data is continuously appended. As seen in the previous article, because of the base, which is the Structured API, we can perform SQL queries over our data. So the amount of technical capability required to construct operations on data is low. Much of the complexity that revolves around the stream processing is abstracted and we are left with configuring and expressing operations at a high level. But still there are concepts of Structured Streaming that one has to be familiar with before starting to work with it. In this article, I would like to provide a basic description of the landscape which will further be explored in the coming articles. To ones who are familiar with Structured APIs and its concepts, there are not many new conce...

Structured API Execution

  Spark Structured API execution can be split up into 3 high level steps: The syntax of the code is checked (common for any code) If the code is found to be valid, Spark converts this to a Logical Plan This Logical plan is then converted to Physical plan. In this process, any optimizations that are possible are applied. Spark executes this Physical Plan(RDD manipulations) on the cluster The first step does not just apply to Spark but to all types of programs. Python checks for syntax of the code written, and fails the job if there is any violation of syntax. The second step is what I would describe as Spark-specific. In this step, a sequence of three types of Logical plans are created, with the most optimal one at the end. Logical plans are abstract, with no clarity on low-level execution details. We could imagine it to be similar to a flow chart that gives a high-level idea about how operations are to be performed and in what order. As the high-level is a good start, a detailed pl...