Contributor Guide
Technical overview
Once registered, PerfGazer will listen to multiple events coming from Spark.
Some event objects at query/job/stage level are stored in memory for later processing.
Those events are wrapped by subtypes of Event. They are mostly start events, with some exceptions.
These are preserved in a CappedConcurrentHashMap that has a maximum size so that memory usage is limited.
The Spark events wrapped are related to classes like:
org.apache.spark...StageInfoorg.apache.spark...SparkListenerJobEnd- ...
When a SQL query, a job, a stage, or a task finishes, it triggers a callback mechanism.
When the inputs are requested to PerfGazer, all collected Events are inspected and transformed into Reports at the end
of the query/job/stage execution enriched with some extra information only available then, according
to the type of Event.
A Report is a type that represents the report unit shared with the end-user.
A Filter is a filter that operates on Reports, so that the end-user can have some control to focus specific aspects of
their Spark ETL (like file pruning for instance).
Build
The project uses sbt.
sbt test # run tests
sbt coverageOn test coverageReport # run tests with coverage checks on
Dev environment
We use IntelliJ IDEA, you can update the ScalaTest Configuration Template to avoid manual settings.
Go to Run -> Edit Configurations -> Edit configuration templates -> ScalaTest
For code formatting setup:
Settings -> Editor -> Code Style -> Scala -> Formatter: ScalaFMT
Run
You can run a local spark-shell with the listener as follows:
# (optional) clean previous local publishes and publish, for example
find ~/.ivy2 -type f -name *perfgazer* | xargs rm
# publish a local snapshot version
sbt publishLocal
# run spark shell with the listener (change the version accordingly) using the snippet provided above
spark-shell --packages io.github.amadeusitgroup:perfgazer_spark_3.5.2_2.12:0.0.2-SNAPSHOT ...
Documentation
The project uses MkDocs with the Material theme.
Local preview
pip install mkdocs mkdocs-material
mkdocs serve
Then open http://127.0.0.1:8000 in your browser.
Deployment
Documentation is versioned using mike and deployed to GitHub Pages automatically:
- Push to
main(changes indocs/ormkdocs.yml) → deploys thedevversion - Publishing a GitHub Release → deploys a versioned copy (e.g.
v0.1.0) and updates thelatestalias
The doc site is available at amadeusitgroup.github.io/spark-perf-gazer.
Contributing
To contribute to this project, see CONTRIBUTING.md.
Releasing
To release a new version of this project, see RELEASING.md.