The following examples have the following spec
fields in common:
-
version
: the current version is "1.0" -
sparkImage
: the docker image that will be used by job, driver and executor pods. This can be provided by the user. -
mode
: onlycluster
is currently supported -
mainApplicationFile
: the artifact (Java, Scala or Python) that forms the basis of the Spark job. -
args
: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset. -
sparkConf
: these list spark configuration settings that are passed directly tospark-submit
and which are best defined explicitly by the user. Since theSparkApplication
"knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together. -
volumes
: refers to any volumes needed by theSparkApplication
, in this case an underlyingPersistentVolumeClaim
. -
driver
: driver-specific settings, including any volume mounts. -
executor
: executor-specific settings, including any volume mounts.
Job-specific settings are annotated below.
link:example$example-sparkapp-external-dependencies.yaml[role=include]
-
Job python artifact (external)
-
Job argument (external)
-
List of python job requirements: these will be installed in the pods via
pip
-
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3)
-
the name of the volume mount backed by a
PersistentVolumeClaim
that must be pre-existing -
the path on the volume mount: this is referenced in the
sparkConf
section where the extra class path is defined for the driver and executors
link:example$example-sparkapp-image.yaml[role=include]
-
Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC
-
Job python artifact (local)
-
Job argument (external)
-
List of python job requirements: these will be installed in the pods via
pip
-
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)
link:example$example-sparkapp-pvc.yaml[role=include]
-
Job artifact located on S3.
-
Job main class
-
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)
-
the name of the volume mount backed by a
PersistentVolumeClaim
that must be pre-existing -
the path on the volume mount: this is referenced in the
sparkConf
section where the extra class path is defined for the driver and executors
link:example$example-sparkapp-s3-private.yaml[role=include]
-
Job python artifact (located in an S3 store)
-
Artifact class
-
S3 section, specifying the existing secret and S3 end-point (in this case, MinIO)
-
Credentials referencing a secretClass (not shown in is example)
-
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources…
-
…in this case, in an S3 store, accessed with the credentials defined in the secret
link:example$example-configmap.yaml[role=include]
link:example$example-sparkapp-configmap.yaml[role=include]
-
Name of the configuration map
-
Argument required by the job
-
Job scala artifact that requires an input argument
-
The volume backed by the configuration map
-
The expected job argument, accessed via the mounted configuration map file
-
The name of the volume backed by the configuration map that will be mounted to the driver/executor
-
The mount location of the volume (this will contain a file
/arguments/job-args.txt
)