Spark
Allows the use of the default Spark I/O features in batch and streaming.
Code repository: https://github.com/AmadeusITGroup/dataio-framework/tree/main/src/main/scala/com/amadeus/dataio/pipes/spark
Useful links:
Common
The following fields are available for all spark components:
Name | Mandatory | Description | Example | Default |
---|---|---|---|---|
path | No | The directory where the data is stored. Note that you may rely on path templatization. | path = "hdfs://path/to/data" | |
table | No | Specifies the table name for reading data, particularly useful when using Unity Catalog. | table = "myproject.mytable" | |
format | No | The format to use to read the data | format = "csv" | The value is set as default in Spark configuration: spark.sql.sources.default |
schema | No | The schema of the input data. See the schema definitions page for more information. | schema = "myproject.models.MySchema" | |
date_filter | No | Pre-filter the input to focus on a specific date range. | See the date filters page for more information. | |
repartition | No | Matches the Spark Dataset repartition function, either by number, expressions or both. One argument, either `exprs` or `num`, is mandatory. | repartition { num = 10, exprs = "upd_date" } | |
coalesce | No | Matches the Spark Dataset coalesce function. | coalesce = 10 | |
options | No | Spark options, as key = value pairs. The list of available options is available in the official Spark API documentation for the DataFrameReader | options { header = true } |
The date_filter field is never mandatory, but be aware that omitting it could result in processing years of data.
Batch
Input
Type: com.amadeus.dataio.pipes.spark.batch.StorageInput
Output
Type: com.amadeus.dataio.pipes.spark.batch.StorageOutput
Streaming
Input
Type: com.amadeus.dataio.pipes.spark.streaming.StorageInput
Output
Type: com.amadeus.dataio.pipes.spark.streaming.StorageOutput
Name | Mandatory | Description | Example | Default |
---|---|---|---|---|
trigger | No | Sets the trigger for the stream query. Can be AvailableNow, Continuous or empty. Controls the trigger() Spark function. In case no Trigger is defined, will set a ProcessingTime trigger. | trigger = "AvailableNow" | |
duration | No | It is used in the ProcessingTime trigger. | duration = "60 seconds" | |
timeout | Yes | Controls the amount of time before returning from the streaming query, in hours. | timeout = 24 |