Spark

Allows the use of the default Spark I/O features in batch and streaming.

Useful links:

Common

The following fields are available for all spark components:

Name	Mandatory	Description	Example	Default
path	No	The directory where the data is stored. Note that you may rely on path templatization.	path = "hdfs://path/to/data"
table	No	Specifies the table name for reading data, particularly useful when using Unity Catalog.	table = "myproject.mytable"
format	No	The format to use to read the data	format = "csv"	The value is set as default in Spark configuration: spark.sql.sources.default
schema	No	The schema of the input data. See the schema definitions page for more information.	schema = "myproject.models.MySchema"
date_filter	No	Pre-filter the input to focus on a specific date range.	See the date filters page for more information.
repartition	No	Matches the Spark Dataset repartition function, either by number, expressions or both. One argument, either `exprs` or `num`, is mandatory.	repartition { num = 10, exprs = "upd_date" }
coalesce	No	Matches the Spark Dataset coalesce function.	coalesce = 10
options	No	Spark options, as key = value pairs. The list of available options is available in the official Spark API documentation for the DataFrameReader	options { header = true }

The date_filter field is never mandatory, but be aware that omitting it could result in processing years of data.

Type: com.amadeus.dataio.pipes.spark.batch.StorageInput

Type: com.amadeus.dataio.pipes.spark.batch.StorageOutput

Type: com.amadeus.dataio.pipes.spark.streaming.StorageInput

Type: com.amadeus.dataio.pipes.spark.streaming.StorageOutput

Name	Mandatory	Description	Example
trigger	No	Sets the trigger for the stream query. Can be AvailableNow, Continuous or empty. Controls the trigger() Spark function. In case no Trigger is defined, will set a ProcessingTime trigger.	trigger = "AvailableNow"
duration	No	It is used in the ProcessingTime trigger.	duration = "60 seconds"
timeout	Yes	Controls the amount of time before returning from the streaming query, in hours.	timeout = 24