File System Storage
Allows the use of the default Spark file system interactions in batch and streaming.
Code repository: https://github.com/AmadeusITGroup/dataio-framework/tree/main/src/main/scala/com/amadeus/dataio/pipes/storage
Useful links:
Common
The following fields are available for all storage components:
Name | Mandatory | Description | Example | Default |
---|---|---|---|---|
Path | true | The directory where the data is stored. Note that you may rely on path templatization. | Path = "hdfs://path/to/data" | |
Format | No | The format to use to read the data | Format = "csv" | The value is set as default in Spark configuration: spark.sql.sources.default |
Schema | No | The schema of the input data. See the schema definitions page for more information. | Schema = "myproject.models.MySchema" | |
DateFilter | No | Pre-filter the input to focus on a specific date range. | ||
Repartition | No | Matches the Spark Dataset repartition function, either by number, columns or both. One argument, either Column or Number, is mandatory. | Repartition { Number = 10, Columns = "upd_date" } | |
Coalesce | No | Matches the Spark Dataset coalesce function. | Coalesce = 10 | |
Options | No | Spark options, as key = value pairs. The list of available options is available in the official Spark API documentation for the DataFrameReader | Options { header = true } |
The DateFilter field is never mandatory, but be aware that omitting it could result in processing years of data.
Batch
Input
Type: com.amadeus.dataio.pipes.storage.batch.StorageInput
Output
Type: com.amadeus.dataio.pipes.storage.batch.StorageOutput
Streaming
Input
Type: com.amadeus.dataio.pipes.storage.streaming.StorageInput
Output
Type: com.amadeus.dataio.pipes.storage.streaming.StorageOutput