Spark

Allows the use of the default Spark I/O features in batch and streaming.

Code repository: https://github.com/AmadeusITGroup/dataio-framework/tree/main/src/main/scala/com/amadeus/dataio/pipes/spark

Useful links:


Common

The following fields are available for all spark components:

Name Mandatory Description Example Default
path No The directory where the data is stored. Note that you may rely on path templatization. path = "hdfs://path/to/data"
table No Specifies the table name for reading data, particularly useful when using Unity Catalog. table = "myproject.mytable"
format No The format to use to read the data format = "csv" The value is set as default in Spark configuration: spark.sql.sources.default
schema No The schema of the input data. See the schema definitions page for more information. schema = "myproject.models.MySchema"
date_filter No Pre-filter the input to focus on a specific date range. See the date filters page for more information.
repartition No Matches the Spark Dataset repartition function, either by number, expressions or both. One argument, either `exprs` or `num`, is mandatory. repartition { num = 10, exprs = "upd_date" }
coalesce No Matches the Spark Dataset coalesce function. coalesce = 10
options No Spark options, as key = value pairs. The list of available options is available in the official Spark API documentation for the DataFrameReader options { header = true }

The date_filter field is never mandatory, but be aware that omitting it could result in processing years of data.


Batch

Input

Type: com.amadeus.dataio.pipes.spark.batch.StorageInput

Output

Type: com.amadeus.dataio.pipes.spark.batch.StorageOutput


Streaming

Input

Type: com.amadeus.dataio.pipes.spark.streaming.StorageInput

Output

Type: com.amadeus.dataio.pipes.spark.streaming.StorageOutput

Name Mandatory Description Example Default
trigger No Sets the trigger for the stream query. Can be AvailableNow, Continuous or empty. Controls the trigger() Spark function. In case no Trigger is defined, will set a ProcessingTime trigger. trigger = "AvailableNow"
duration No It is used in the ProcessingTime trigger. duration = "60 seconds"
timeout Yes Controls the amount of time before returning from the streaming query, in hours. timeout = 24