File System Storage

Allows the use of the default Spark file system interactions in batch and streaming.

Code repository: https://github.com/AmadeusITGroup/dataio-framework/tree/main/src/main/scala/com/amadeus/dataio/pipes/storage

Useful links:


Common

The following fields are available for all storage components:

Name Mandatory Description Example Default
Path true The directory where the data is stored. Note that you may rely on path templatization. Path = "hdfs://path/to/data"
Format No The format to use to read the data Format = "csv" The value is set as default in Spark configuration: spark.sql.sources.default
Schema No The schema of the input data. See the schema definitions page for more information. Schema = "myproject.models.MySchema"
DateFilter No Pre-filter the input to focus on a specific date range.
Repartition No Matches the Spark Dataset repartition function, either by number, columns or both. One argument, either Column or Number, is mandatory. Repartition { Number = 10, Columns = "upd_date" }
Coalesce No Matches the Spark Dataset coalesce function. Coalesce = 10
Options No Spark options, as key = value pairs. The list of available options is available in the official Spark API documentation for the DataFrameReader Options { header = true }

The DateFilter field is never mandatory, but be aware that omitting it could result in processing years of data.


Batch

Input

Type: com.amadeus.dataio.pipes.storage.batch.StorageInput

Output

Type: com.amadeus.dataio.pipes.storage.batch.StorageOutput


Streaming

Input

Type: com.amadeus.dataio.pipes.storage.streaming.StorageInput

Output

Type: com.amadeus.dataio.pipes.storage.streaming.StorageOutput