Usage

Examples

The following examples have the following spec fields in common:

version: the current version is "1.0"
sparkImage: the docker image that will be used by job, driver and executor pods. This can be provided by the user.
mode: only cluster is currently supported
mainApplicationFile: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
args: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.
sparkConf: these list spark configuration settings that are passed directly to spark-submit and which are best defined explicitly by the user. Since the SparkApplication "knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together.
volumes: refers to any volumes needed by the SparkApplication, in this case an underlying PersistentVoulmeClaim.
driver: driver-specific settings, including any volume mounts.
executor: executor-specific settings, including any volume mounts.

Job-specific settings are annotated below.

Pyspark: externally located artifact and dataset

link:example$example-sparkapp-external-dependencies.yaml[role=include]

Job python artifact (external)
Job argument (external)
List of python job requirements: these will be installed in the pods via pip
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3)
the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing
the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

Pyspark: externally located dataset, artifact available via PVC/volume mount

link:example$example-sparkapp-image.yaml[role=include]

Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC
Job python artifact (local)
Job argument (external)
List of python job requirements: these will be installed in the pods via pip
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)
the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing
the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

JVM (Scala): externally located artifact and dataset

link:example$example-sparkapp-pvc.yaml[role=include]

Job artifact located on S3.
Job main class
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)
the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing
the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

JVM (Scala): externally located artifact accessed with credentials

link:example$example-sparkapp-s3-private.yaml[role=include]

Job python artifact (located in an S3 store)
Artifact class
S3 section, specifying the existing secret and S3 end-point (in this case, MinIO)
Credentials referencing a secretClass (not shown in is example)
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources…
…in this case, in an S3 store, accessed with the credentials defined in the secret
the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing
the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

JVM (Scala): externally located artifact accessed with job arguments provided via configuration map

link:example$example-configmap.yaml[role=include]

link:example$example-sparkapp-configmap.yaml[role=include]

Name of the configuration map
Argument required by the job
Job scala artifact that requires an input argument
The volume backed by the configuration map
The expected job argument, accessed via the mounted configuration map file
The name of the volume backed by the configuration map that will be mounted to the driver/executor
The mount location of the volume (this will contain a file /arguments/job-args.txt)

S3 bucket specification

You can specify S3 connection details directly inside the SparkApplication specification or by referring to an external S3Bucket custom resource.

To specify S3 connection details directly as part of the SparkApplication resource you add an inline bucket configuration as shown below.

s3bucket:  # (1)
  inline:
    bucketName: my-bucket # (2)
    connection:
      inline:
        host: test-minio # (3)
        port: 9000 # (4)
        accessStyle: Path
        credentials:
          secretClass: s3-credentials-class  # (5)

Entry point for the bucket configuration.
Bucket name.
Bucket host.
Optional bucket port.
Name of the Secret object expected to contain the following keys: ACCESS_KEY_ID and SECRET_ACCESS_KEY

It is also possible to configure the bucket connection details as a separate Kubernetes resource and only refer to that object from the SparkApplication like this:

s3bucket:
  reference: my-bucket-resource # (1)

Name of the bucket resource with connection details.

The resource named my-bucket-resource is then defined as shown below:

---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Bucket
metadata:
  name: my-bucket-resource
spec:
  bucketName: my-bucket-name
  connection:
    inline:
      host: test-minio
      port: 9000
      accessStyle: Path
      credentials:
        secretClass: minio-credentials-class

This has the advantage that bucket configuration can be shared across `SparkApplication`s and reduces the cost of updating these details.

CRD argument coverage

Below are listed the CRD fields that can be defined by the user:

CRD field	Remarks
`apiVersion`	`spark.stackable.tech/v1alpha1`
`kind`	`SparkApplication`
`metadata.name`	Job name
`spec.version`	"1.0"
`spec.mode`	`cluster` or `client`. Currently only `cluster` is supported
`spec.image`	User-supplied image containing spark-job dependencies that will be copied to the specified volume mount
`spec.sparkImage`	Spark image which will be deployed to driver and executor pods, which must contain spark environment needed by the job e.g. `docker.stackable.tech/stackable/spark-k8s:3.3.0-stackable0.1.0`
`spec.sparkImagePullPolicy`	Optional Enum (one of `Always`, `IfNotPresent` or `Never`) that determines the pull policy of the spark job image
`spec.sparkImagePullSecrets`	An optional list of references to secrets in the same namespace to use for pulling any of the images used by a `SparkApplication` resource. Each reference has a single property (`name`) that must contain a reference to a valid secret
`spec.mainApplicationFile`	The actual application file that will be called by `spark-submit`
`spec.mainClass`	The main class i.e. entry point for JVM artifacts
`spec.args`	Arguments passed directly to the job artifact
`spec.s3bucket`	S3 bucket and connection specification. See the S3 bucket specification for more details.
`spec.sparkConf`	A map of key/value strings that will be passed directly to `spark-submit`
`spec.deps.requirements`	A list of python packages that will be installed via `pip`
`spec.deps.packages`	A list of packages that is passed directly to `spark-submit`
`spec.deps.excludePackages`	A list of excluded packages that is passed directly to `spark-submit`
`spec.deps.repositories`	A list of repositories that is passed directly to `spark-submit`
`spec.volumes`	A list of volumes
`spec.volumes.name`	The volume name
`spec.volumes.persistentVolumeClaim.claimName`	The persistent volume claim backing the volume
`spec.driver.cores`	Number of cores used by the driver (only in cluster mode)
`spec.driver.coreLimit`	Total cores for all executors
`spec.driver.memory`	Specified memory for the driver
`spec.driver.volumeMounts`	A list of mounted volumes for the driver
`spec.driver.volumeMounts.name`	Name of mount
`spec.driver.volumeMounts.mountPath`	Volume mount path
`spec.driver.nodeSelector`	A dictionary of labels to use for node selection when scheduling the driver N.B. this assumes there are no implicit node dependencies (e.g. `PVC`, `VolumeMount`) defined elsewhere.
`spec.executor.cores`	Number of cores for each executor
`spec.executor.instances`	Number of executor instances launched for this job
`spec.executor.memory`	Memory specified for executor
`spec.executor.volumeMounts`	A list of mounted volumes for each executor
`spec.executor.volumeMounts.name`	Name of mount
`spec.executor.volumeMounts.mountPath`	Volume mount path
`spec.executor.nodeSelector`	A dictionary of labels to use for node selection when scheduling the executors N.B. this assumes there are no implicit node dependencies (e.g. `PVC`, `VolumeMount`) defined elsewhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usage.adoc

usage.adoc

Usage

Examples

Pyspark: externally located artifact and dataset

Pyspark: externally located dataset, artifact available via PVC/volume mount

JVM (Scala): externally located artifact and dataset

JVM (Scala): externally located artifact accessed with credentials

JVM (Scala): externally located artifact accessed with job arguments provided via configuration map

S3 bucket specification

CRD argument coverage

Files

usage.adoc

Latest commit

History

usage.adoc

File metadata and controls

Usage

Examples

Pyspark: externally located artifact and dataset

Pyspark: externally located dataset, artifact available via PVC/volume mount

JVM (Scala): externally located artifact and dataset

JVM (Scala): externally located artifact accessed with credentials

JVM (Scala): externally located artifact accessed with job arguments provided via configuration map

S3 bucket specification

CRD argument coverage