Main Concepts
Table of contents
This page provides an overview of the key aspects that form the foundation of the Data I/O framework. Understanding these concepts is essential for effectively using the framework to design and implement your data processing pipelines.
Building Blocks
Pipeline
The pipeline is the central component of the Data I/O framework. It is responsible for instantiating the components defined in the configuration and running the transformer. You can easily define and configure your pipeline using the provided configuration file, allowing you to focus on the essential task of data transformation.
For more information about how to configure your applications, visit the configuration page.
Pipes
Pipes are the fundamental components of the PyData I/O framework, representing the sources and destinations of data within the pipeline. Inputs handle the reading of data from various sources, while outputs handle the writing of transformed data to different destinations. By configuring inputs and outputs in the provided configuration file, you can define the specific data sources and destinations for your ETL pipelines. PyData I/O provides flexible and extensible options for handling various file formats, databases, and streaming data sources.
In your code, pipes are closely related to Spark’s reading and writing capabilities. When configuring your application, you’ll specify the source or destination of your data, and PyData I/O will use Spark’s underlying APIs to perform the actual reading and writing.
No matter where data comes from or goes, each input pipe returns a DataFrame and each output pipe receives a DataFrame to write.
JobConfig
The JobConfig plays a crucial role in the pipeline by providing access to different handlers responsible for interacting with pipes. Indeed, through the JobConfig, you can access the input/output, enabling seamless integration and interaction with the corresponding components defined in your configuration file.
The JobConfig is injected in the featurize method of your transformer.
class MyDataTransformer(Transformer):
def featurize(self, jobConfig: JobConfig, spark: SparkSession, additionalArgs: dict = None):
# Read, transform and write
Transformers
Transformers play a vital role in the PyData IO framework by encapsulating the data transformation logic within the pipeline. A transformer is a custom class that you define to implement the specific data processing steps required for your ETL pipelines. It represents a single stage or operation in the pipeline and is responsible for manipulating the data according to your business requirements.
When the pipeline is executed, the run method of the transformer is executed, which contains the custom transformation logic. Inside, you can access and manipulate the data using the provided JobConfig, which provides access to the Input and Output.
Inside your transformer, you can easily modularize and organize your data transformation steps, making the pipeline more testable and maintainable. Each function in the transformer can be designed to handle a specific data processing task, such as cleaning, aggregating, joining, or any other required transformations.
class MyDataTransformer(Transformer):
def featurize(self, jobConfig: JobConfig, spark: SparkSession, additionalArgs: dict = None):
# Access input data
data = jobConfig.load(inputName="my-input", spark=spark)
# Perform data transformation
transformed_data = self.transformData(data)
# Write transformed data to output
jobConfig.writer.save(data=fixed_data)
def transformData(self, inputData: DataFrame):
# Your custom data transformation logic here
# Example: Perform data cleansing, filtering, or aggregations
# Return the transformed DataFrame
...
For a detailed guide about how to write transformer for your applications, visit the transformer page.
Lifecycle of a PyData I/O application
flowchart LR
conf(application.conf)
pipeline(Data I/O Pipeline)
pipeline-->|1 reads and parses|conf
accessor(JobConfig)
pipeline-->|2 parse|configuration
pipes(Pipes)
accessor-->|2 creates|pipes
accessor-. exposes .->pipes
processor(Data processor)
pipeline-->|3 runs|transformer
transformer-. uses .->accessor
When the application starts, the pipeline reads and parses the configuration file. From this, it triggers a series of instantiations that result in the creation of the JobConfig.
Finally, the pipeline instantiates and runs the data processor, giving it the JobConfig that it instantiated.