Contributor Guide
Technical overview
Once registered, PerfGazer will listen to multiple events coming from Spark.
Some event objects at query/job/stage level are stored in memory for later processing.
Those events are wrapped by subtypes of Event. They are mostly start events, with some exceptions.
These are preserved in a CappedConcurrentHashMap that has a maximum size so that memory usage is limited.
The Spark events wrapped are related to classes like:
org.apache.spark...StageInfoorg.apache.spark...SparkListenerJobEnd- ...
When a SQL query, a job, a stage, or a task finishes, it triggers a callback mechanism.
When the inputs are requested to PerfGazer, all collected Events are inspected and transformed into Reports at the end
of the query/job/stage execution enriched with some extra information only available then, according
to the type of Event.
A Report is a type that represents the report unit shared with the end-user.
Report case classes are annotated with @TableDoc and @ColumnDoc to serve as the single source of truth
for the data model documentation (see Data model documentation below).
Build
The project uses sbt.
sbt test # run tests
sbt coverageOn test coverageReport # run tests with coverage checks on
Dev environment
We use IntelliJ IDEA, you can update the ScalaTest Configuration Template to avoid manual settings.
Go to Run -> Edit Configurations -> Edit configuration templates -> ScalaTest
For code formatting setup:
Settings -> Editor -> Code Style -> Scala -> Formatter: ScalaFMT
Run
You can run a local spark-shell with the listener as follows:
# publish a local snapshot version
export VERSION=0.0.0-$RANDOM-$RANDOM
sbt "set ThisBuild / version := \"$VERSION\"" publishLocal
# run spark shell with the listener (change the version accordingly) using the snippet provided above
spark-shell \
--packages io.github.amadeusitgroup:perfgazer_spark_3-5-2_2.12:$VERSION \
--conf spark.extraListeners=com.amadeus.perfgazer.PerfGazer \
--conf spark.perfgazer.sink.class=com.amadeus.perfgazer.JsonSink \
--conf spark.perfgazer.sink.json.destination=/tmp/perfgazer/applicationId={{spark.app.id}}/ \
--conf "spark.driver.bindAddress=127.0.0.1" --conf "spark.driver.host=127.0.0.1"
Then you can run something like this in the shell to see logs from the listener:
sc.setLogLevel("INFO") // to change the log level
spark.sql("select 1").show()
:quit
Documentation
The project uses MkDocs with the Material theme.
Data model documentation
The data model (SQL view schemas) is documented via custom annotations on the report case classes in core/.../reports/.
A build-time generator (doc-generator/) reads these annotations and produces:
docs/user_guide/data_model.md— human-friendly Markdown tables with SQL typesdocs/schema/perfgazer-schema.json— agent-friendly structured JSON
When adding or modifying fields in a report case class, annotate them with @ColumnDoc:
@ColumnDoc(description = "Wall-clock duration of the task", unit = "ms")
taskDuration: Long,
When adding a new report case class, annotate the class with @TableDoc:
@TableDoc(name = "task", description = "Task-level execution metrics. One row per completed Spark task.")
case class TaskReport(
...
Both generated files are gitignored — they are produced by CI and by the local preview script.
Local preview
First, generate the data model schemas from the annotated case classes:
sbt docGenerator/run
Then serve the site locally:
pip install mkdocs mkdocs-material
mkdocs serve
Open http://127.0.0.1:8000 in your browser.
Alternatively, ./scripts/docs-serve-local.sh runs both steps in sequence.
Full build
To reproduce the full CI documentation build (including llms.txt for AI agents):
./scripts/docs-build.sh
This runs schema generation, mkdocs build, and generate-llms-txt.sh. Output goes to site/.
Deployment
Documentation is versioned using mike and deployed to GitHub Pages automatically under the following conditions:
- When pushing to
mainwith changes indocs/,mkdocs.yml, report classes, ordoc-generator/, thedevversion is deployed. - When publishing a GitHub Release, a versioned copy (e.g.
v0.1.0) is deployed and thelatestalias is updated.
The doc site is available at amadeusitgroup.github.io/spark-perf-gazer.
Removing a published version
Deleting a GitHub Release does not remove the version from the docs site — versioned docs live as static files on the gh-pages branch, independent of GitHub Releases.
To remove a version from the docs site:
mike delete <version> # e.g. mike delete v0.1.0
git push origin gh-pages
If you also want to clean up the GitHub Release and its tag, do that separately via the GitHub UI or CLI.
Scripts reference
| Script | Purpose |
|---|---|
scripts/docs-serve-local.sh |
Full local docs build + live preview server (mkdocs serve) |
scripts/docs-build.sh |
Full docs build (schemas + MkDocs + llms.txt), same as CI |
scripts/generate-llms-txt.sh |
Generate llms.txt and llms-full.txt in docs/ for agent consumption |
Contributing
To contribute to this project, see CONTRIBUTING.md.
Releasing
To release a new version of this project, see RELEASING.md.