Spark

Allows the use of the default Spark I/O features in batch and streaming.

Code repository: https://github.com/AmadeusITGroup/PyDataIO/tree/main/src/PyDataIO/io/pipes/spark

Useful links:


Common

The following fields are available for all spark components:

Name Mandatory Description Example Default
path No The directory where the data is stored. path: "hdfs://path/to/data"
format No The format to use to read the data format: "csv" The value is set as default in Spark configuration: spark.sql.sources.default
schema No The schema of the input data. See the schema definitions page for more information. schema: "my-schema"
options No Spark options, as key = value pairs. The list of available options is available in the official Spark API documentation for the DataFrameReader options: header: true failOnDataLoss: false

Batch

Input

Type: pipes.spark.batch.SparkInput

Output

Type: pipes.spark.batch.SparkOutput


Streaming

Input

Type: pipes.spark.streaming.SparkInput

Output

Type: pipes.spark.batch.SparkOutput

Name Mandatory Description Example Default
trigger Yes Sets the trigger for the stream query. Can be ProcessingTimeTrigger, OneTimeTrigger, AvailableNowTrigger. Controls the trigger() Spark function (See pydataio.io.utils.triggers). trigger: "AvailableNow"
duration No It is used in the ProcessingTime trigger. duration: "60 seconds"
timeout Yes Controls the amount of time before returning from the streaming query. timeout: "2 minutes"
sinkProcessor No Fully qualified name of a skin processor to be used for the [for each batch sink](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.foreachBatch.html). sinkProcessor: "getting.started.MySinkProcessor"