Spark
Allows the use of the default Spark I/O features in batch and streaming.
Code repository: https://github.com/AmadeusITGroup/PyDataIO/tree/main/src/PyDataIO/io/pipes/spark
Useful links:
Common
The following fields are available for all spark components:
| Name | Mandatory | Description | Example | Default |
|---|---|---|---|---|
| path | No | The directory where the data is stored. | path: "hdfs://path/to/data" | |
| format | No | The format to use to read the data | format: "csv" | The value is set as default in Spark configuration: spark.sql.sources.default |
| schema | No | The schema of the input data. See the schema definitions page for more information. | schema: "my-schema" | |
| options | No | Spark options, as key = value pairs. The list of available options is available in the official Spark API documentation for the DataFrameReader | options: header: true failOnDataLoss: false |
Batch
Input
Type: pipes.spark.batch.SparkInput
Output
Type: pipes.spark.batch.SparkOutput
Streaming
Input
Type: pipes.spark.streaming.SparkInput
Output
Type: pipes.spark.batch.SparkOutput
| Name | Mandatory | Description | Example | Default |
|---|---|---|---|---|
| trigger | Yes | Sets the trigger for the stream query. Can be ProcessingTimeTrigger, OneTimeTrigger, AvailableNowTrigger. Controls the trigger() Spark function (See pydataio.io.utils.triggers). | trigger: "AvailableNow" | |
| duration | No | It is used in the ProcessingTime trigger. | duration: "60 seconds" | |
| timeout | Yes | Controls the amount of time before returning from the streaming query. | timeout: "2 minutes" | |
| sinkProcessor | No | Fully qualified name of a skin processor to be used for the [for each batch sink](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.foreachBatch.html). | sinkProcessor: "getting.started.MySinkProcessor" |