Skip to content

Files

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Latest commit

b48436f · Apr 19, 2022

History

History
225 lines (167 loc) · 8.18 KB

usage.adoc

File metadata and controls

225 lines (167 loc) · 8.18 KB

Usage

Create an Apache Spark job

If you followed the installation instructions, you should now have a Stackable Operator for Apache Spark up and running and you are ready to create your first Apache Spark kubernetes cluster.

The example below creates a job running on Apache Spark 3.2.1, using the spark-on-kubernetes paradigm described in the spark documentation. The application file is itself part of the spark distribution and local refers to the path on the driver/executors; there are no external dependencies.

cat <<EOF | kubectl apply -f -
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: spark-clustermode-001
spec:
  version: 3.2.1-hadoop3.2
  mode: cluster
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///stackable/spark/examples/jars/spark-examples_2.12-3.2.1.jar
  image: 3.2.1-hadoop3.2
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
  executor:
    cores: 1
    instances: 3
    memory: "512m"
EOF

Examples

The following examples have the following spec fields in common:

  • version: the current version is "1.0"

  • sparkImage: the docker image that will be used by job, driver and executor pods. This can be provided by the user.

  • mode: only cluster is currently supported

  • mainApplicationFile: the artifact (Java, Scala or Python) that forms the basis of the Spark job.

  • args: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.

  • sparkConf: these list spark configuration settings that are passed directly to spark-submit and which are best defined explicitly by the user. Since the SparkApplication "knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together.

  • volumes: refers to any volumes needed by the SparkApplication, in this case an underlying PersistentVoulmeClaim.

  • driver: driver-specific settings, including any volume mounts.

  • executor: executor-specific settings, including any volume mounts.

Job-specific settings are annotated below.

Pyspark: externally located artifact and dataset

link:example$example-sparkapp-external-dependencies.yaml[role=include]
  1. Job python artifact (external)

  2. Job argument (external)

  3. List of python job requirements: these will be installed in the pods via pip

  4. Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3)

  5. the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing

  6. the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

Pyspark: externally located dataset, artifact available via PVC/volume mount

link:example$example-sparkapp-image.yaml[role=include]
  1. Job image: this contains the job artifact that will retrieved from the volume mount backed by the PVC

  2. Job python artifact (local)

  3. Job argument (external)

  4. List of python job requirements: these will be installed in the pods via pip

  5. Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3)

  6. the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing

  7. the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

JVM (Scala): externally located artifact and dataset

link:example$example-sparkapp-pvc.yaml[role=include]
  1. Job artifact located on S3.

  2. Job main class

  3. Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3, accessed without credentials)

  4. the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing

  5. the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

JVM (Scala): externally located artifact accessed with credentials

link:example$example-sparkapp-s3-private.yaml[role=include]
  1. Job python artifact (located in S3)

  2. Artifact class

  3. S3 section, specifying the existing secret and S3 end-point (in this case, MinIO)

  4. Credentials secret

  5. Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources…​

  6. …​in this case, in s3, accessed with the credentials defined in the secret

  7. the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing

  8. the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

JVM (Scala): externally located artifact accessed with job arguments provided via configuration map

link:example$example-configmap.yaml[role=include]
link:example$example-sparkapp-configmap.yaml[role=include]
  1. Name of the configuration map

  2. Argument required by the job

  3. Job scala artifact that requires an input argument

  4. The volume backed by the configuration map

  5. The expected job argument, accessed via the mounted configuration map file

  6. The name of the volume backed by the configuration map that will be mounted to the driver/executor

  7. The mount location of the volume (this will contain a file /arguments/job-args.txt)

CRD argument coverage

Below are listed the CRD fields that can be defined by the user:

CRD field Remarks

apiVersion

spark.stackable.tech/v1alpha1

kind

SparkApplication

metadata.name

Job name

spec.version

"1.0"

spec.mode

cluster or client. Currently only cluster is supported

spec.image

User-supplied image containing spark-job dependencies that will be copied to the specified volume mount

spec.sparkImage

Spark image which will be deployed to driver and executor pods, which must contain spark environment needed by the job e.g. docker.stackable.tech/stackable/spark-k8s:3.2.1-hadoop3.2-stackable0.4.0

spec.mainApplicationFile

The actual application file that will be called by spark-submit

spec.mainClass

The main class i.e. entry point for JVM artifacts

spec.args

Arguments passed directly to the job artifact

spec.s3.credentialsSecret

Name of the credentials secret for S3 access

spec.s3.endpoint

S3 endpoint

spec.sparkConf

A map of key/value strings that will be passed directly to spark-submit

spec.deps.requirements

A list of python packages that will be installed via pip

spec.deps.packages

A list of packages that is passed directly to spark-submit

spec.deps.excludePackages

A list of excluded packages that is passed directly to spark-submit

spec.deps.repositories

A list of repositories that is passed directly to spark-submit

spec.volumes

A list of volumes

spec.volumes.name

The volume name

spec.volumes.persistentVolumeClaim.claimName

The persistent volume claim backing the volume

spec.driver.cores

Number of cores used by the driver (only in cluster mode)

spec.driver.coreLimit

Total cores for all executors

spec.driver.memory

Specified memory for the driver

spec.driver.volumeMounts

A list of mounted volumes for the driver

spec.driver.volumeMounts.name

Name of mount

spec.driver.volumeMounts.mountPath

Volume mount path

spec.executor.cores

Number of cores for each executor

spec.executor.instances

Number of executor instances launched for this job

spec.executor.memory

Memory specified for executor

spec.executor.volumeMounts

A list of mounted volumes for each executor

spec.executor.volumeMounts.name

Name of mount

spec.executor.volumeMounts.mountPath

Volume mount path