Spark

Allows the use of the default Spark I/O features in batch and streaming.

Useful links:

Common

The following fields are available for all spark components:

Name	Mandatory	Description	Example	Default
path	No	The directory where the data is stored.	path: "hdfs://path/to/data"
format	No	The format to use to read the data	format: "csv"	The value is set as default in Spark configuration: spark.sql.sources.default
schema	No	The schema of the input data. See the schema definitions page for more information.	schema: "my-schema"
options	No	Spark options, as key = value pairs. The list of available options is available in the official Spark API documentation for the DataFrameReader	options: header: true failOnDataLoss: false

Type: pipes.spark.batch.SparkInput

Type: pipes.spark.batch.SparkOutput

Type: pipes.spark.streaming.SparkInput

Type: pipes.spark.batch.SparkOutput

Name	Mandatory	Description	Example
trigger	Yes	Sets the trigger for the stream query. Can be ProcessingTimeTrigger, OneTimeTrigger, AvailableNowTrigger. Controls the trigger() Spark function (See pydataio.io.utils.triggers).	trigger: "AvailableNow"
duration	No	It is used in the ProcessingTime trigger.	duration: "60 seconds"
timeout	Yes	Controls the amount of time before returning from the streaming query.	timeout: "2 minutes"
sinkProcessor	No	Fully qualified name of a skin processor to be used for the [for each batch sink](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.foreachBatch.html).	sinkProcessor: "getting.started.MySinkProcessor"