Skip to content

Files

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Latest commit

author
Felix Hennig
Apr 13, 2023
a9b6997 · Apr 13, 2023

History

History
87 lines (70 loc) · 4.38 KB

File metadata and controls

87 lines (70 loc) · 4.38 KB

Examples

The following examples have the following spec fields in common:

  • version: the current version is "1.0"

  • sparkImage: the docker image that will be used by job, driver and executor pods. This can be provided by the user.

  • mode: only cluster is currently supported

  • mainApplicationFile: the artifact (Java, Scala or Python) that forms the basis of the Spark job.

  • args: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.

  • sparkConf: these list spark configuration settings that are passed directly to spark-submit and which are best defined explicitly by the user. Since the SparkApplication "knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together.

  • volumes: refers to any volumes needed by the SparkApplication, in this case an underlying PersistentVolumeClaim.

  • driver: driver-specific settings, including any volume mounts.

  • executor: executor-specific settings, including any volume mounts.

Job-specific settings are annotated below.

Pyspark: externally located artifact and dataset

link:example$example-sparkapp-external-dependencies.yaml[role=include]
  1. Job python artifact (external)

  2. Job argument (external)

  3. List of python job requirements: these will be installed in the pods via pip

  4. Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3)

  5. the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing

  6. the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

Pyspark: externally located dataset, artifact available via PVC/volume mount

link:example$example-sparkapp-image.yaml[role=include]
  1. Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC

  2. Job python artifact (local)

  3. Job argument (external)

  4. List of python job requirements: these will be installed in the pods via pip

  5. Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)

JVM (Scala): externally located artifact and dataset

link:example$example-sparkapp-pvc.yaml[role=include]
  1. Job artifact located on S3.

  2. Job main class

  3. Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)

  4. the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing

  5. the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

JVM (Scala): externally located artifact accessed with credentials

link:example$example-sparkapp-s3-private.yaml[role=include]
  1. Job python artifact (located in an S3 store)

  2. Artifact class

  3. S3 section, specifying the existing secret and S3 end-point (in this case, MinIO)

  4. Credentials referencing a secretClass (not shown in is example)

  5. Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources…​

  6. …​in this case, in an S3 store, accessed with the credentials defined in the secret

JVM (Scala): externally located artifact accessed with job arguments provided via configuration map

link:example$example-configmap.yaml[role=include]
link:example$example-sparkapp-configmap.yaml[role=include]
  1. Name of the configuration map

  2. Argument required by the job

  3. Job scala artifact that requires an input argument

  4. The volume backed by the configuration map

  5. The expected job argument, accessed via the mounted configuration map file

  6. The name of the volume backed by the configuration map that will be mounted to the driver/executor

  7. The mount location of the volume (this will contain a file /arguments/job-args.txt)