Examples

The following examples have the following spec fields in common:

version: the current version is "1.0"
sparkImage: the docker image that will be used by job, driver and executor pods. This can be provided by the user.
mode: only cluster is currently supported
mainApplicationFile: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
args: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.
sparkConf: these list spark configuration settings that are passed directly to spark-submit and which are best defined explicitly by the user. Since the SparkApplication "knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together.
volumes: refers to any volumes needed by the SparkApplication, in this case an underlying PersistentVolumeClaim.
driver: driver-specific settings, including any volume mounts.
executor: executor-specific settings, including any volume mounts.

Job-specific settings are annotated below.

Pyspark: externally located artifact and dataset

link:example$example-sparkapp-external-dependencies.yaml[role=include]

Job python artifact (external)
Job argument (external)
List of python job requirements: these will be installed in the pods via pip
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3)
the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing
the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

link:example$example-sparkapp-image.yaml[role=include]

Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC
Job python artifact (local)
Job argument (external)
List of python job requirements: these will be installed in the pods via pip
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)

link:example$example-sparkapp-pvc.yaml[role=include]

Job artifact located on S3.
Job main class
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)
the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing
the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

link:example$example-sparkapp-s3-private.yaml[role=include]

Job python artifact (located in an S3 store)
Artifact class
S3 section, specifying the existing secret and S3 end-point (in this case, MinIO)
Credentials referencing a secretClass (not shown in is example)
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources…
…in this case, in an S3 store, accessed with the credentials defined in the secret

link:example$example-configmap.yaml[role=include]

link:example$example-sparkapp-configmap.yaml[role=include]

Name of the configuration map
Argument required by the job
Job scala artifact that requires an input argument
The volume backed by the configuration map
The expected job argument, accessed via the mounted configuration map file
The name of the volume backed by the configuration map that will be mounted to the driver/executor
The mount location of the volume (this will contain a file /arguments/job-args.txt)