Skip to content

Contributor Guide

Technical overview

Once registered, PerfGazer will listen to multiple events coming from Spark.

Some event objects at query/job/stage level are stored in memory for later processing. Those events are wrapped by subtypes of Event. They are mostly start events, with some exceptions. These are preserved in a CappedConcurrentHashMap that has a maximum size so that memory usage is limited. The Spark events wrapped are related to classes like:

  • org.apache.spark...StageInfo
  • org.apache.spark...SparkListenerJobEnd
  • ...

When a SQL query, a job, a stage, or a task finishes, it triggers a callback mechanism.

When the inputs are requested to PerfGazer, all collected Events are inspected and transformed into Reports at the end of the query/job/stage execution enriched with some extra information only available then, according to the type of Event.

A Report is a type that represents the report unit shared with the end-user.

A Filter is a filter that operates on Reports, so that the end-user can have some control to focus specific aspects of their Spark ETL (like file pruning for instance).

Build

The project uses sbt.

sbt test                           # run tests
sbt coverageOn test coverageReport # run tests with coverage checks on

Dev environment

We use IntelliJ IDEA, you can update the ScalaTest Configuration Template to avoid manual settings.

Go to Run -> Edit Configurations -> Edit configuration templates -> ScalaTest 

For code formatting setup:

Settings -> Editor -> Code Style -> Scala -> Formatter: ScalaFMT

Run

You can run a local spark-shell with the listener as follows:

# (optional) clean previous local publishes and publish, for example
find ~/.ivy2 -type f -name *perfgazer* | xargs rm
# publish a local snapshot version
sbt publishLocal
# run spark shell with the listener (change the version accordingly) using the snippet provided above
spark-shell --packages io.github.amadeusitgroup:perfgazer_spark_3.5.2_2.12:0.0.2-SNAPSHOT ...

Documentation

The project uses MkDocs with the Material theme.

Local preview

pip install mkdocs mkdocs-material
mkdocs serve

Then open http://127.0.0.1:8000 in your browser.

Deployment

Documentation is versioned using mike and deployed to GitHub Pages automatically:

  • Push to main (changes in docs/ or mkdocs.yml) → deploys the dev version
  • Publishing a GitHub Release → deploys a versioned copy (e.g. v0.1.0) and updates the latest alias

The doc site is available at amadeusitgroup.github.io/spark-perf-gazer.

Contributing

To contribute to this project, see CONTRIBUTING.md.

Releasing

To release a new version of this project, see RELEASING.md.