If you followed the installation instructions, you should now have a Stackable Operator for Apache Spark up and running and you are ready to create your first Apache Spark kubernetes cluster.
The example below creates a job running on Apache Spark 3.2.1, using the spark-on-kubernetes paradigm described in the spark documentation. The application file is itself part of the spark distribution and local
refers to the path on the driver/executors; there are no external dependencies.
cat <<EOF | kubectl apply -f - apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: spark-clustermode-001 spec: version: 3.2.1-hadoop3.2 mode: cluster mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: local:///stackable/spark/examples/jars/spark-examples_2.12-3.2.1.jar image: 3.2.1-hadoop3.2 driver: cores: 1 coreLimit: "1200m" memory: "512m" executor: cores: 1 instances: 3 memory: "512m" EOF
The following examples have the following spec
fields in common:
-
version
: the current version is "1.0" -
sparkImage
: the docker image that will be used by job, driver and executor pods. This can be provided by the user. -
mode
: onlycluster
is currently supported -
mainApplicationFile
: the artifact (Java, Scala or Python) that forms the basis of the Spark job. -
args
: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset. -
sparkConf
: these list spark configuration settings that are passed directly tospark-submit
and which are best defined explicitly by the user. Since theSparkApplication
"knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together. -
volumes
: refers to any volumes needed by theSparkApplication
, in this case an underlyingPersistentVoulmeClaim
. -
driver
: driver-specific settings, including any volume mounts. -
executor
: executor-specific settings, including any volume mounts.
Job-specific settings are annotated below.
link:example$example-sparkapp-external-dependencies.yaml[role=include]
-
Job python artifact (external)
-
Job argument (external)
-
List of python job requirements: these will be installed in the pods via
pip
-
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3)
-
the name of the volume mount backed by a
PersistentVolumeClaim
that must be pre-existing -
the path on the volume mount: this is referenced in the
sparkConf
section where the extra class path is defined for the driver and executors
link:example$example-sparkapp-image.yaml[role=include]
-
Job image: this contains the job artifact that will retrieved from the volume mount backed by the PVC
-
Job python artifact (local)
-
Job argument (external)
-
List of python job requirements: these will be installed in the pods via
pip
-
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3)
-
the name of the volume mount backed by a
PersistentVolumeClaim
that must be pre-existing -
the path on the volume mount: this is referenced in the
sparkConf
section where the extra class path is defined for the driver and executors
link:example$example-sparkapp-pvc.yaml[role=include]
-
Job artifact located on S3.
-
Job main class
-
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3, accessed without credentials)
-
the name of the volume mount backed by a
PersistentVolumeClaim
that must be pre-existing -
the path on the volume mount: this is referenced in the
sparkConf
section where the extra class path is defined for the driver and executors
link:example$example-sparkapp-s3-private.yaml[role=include]
-
Job python artifact (located in S3)
-
Artifact class
-
S3 section, specifying the existing secret and S3 end-point (in this case, MinIO)
-
Credentials secret
-
Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources…
-
…in this case, in s3, accessed with the credentials defined in the secret
-
the name of the volume mount backed by a
PersistentVolumeClaim
that must be pre-existing -
the path on the volume mount: this is referenced in the
sparkConf
section where the extra class path is defined for the driver and executors
link:example$example-configmap.yaml[role=include]
link:example$example-sparkapp-configmap.yaml[role=include]
-
Name of the configuration map
-
Argument required by the job
-
Job scala artifact that requires an input argument
-
The volume backed by the configuration map
-
The expected job argument, accessed via the mounted configuration map file
-
The name of the volume backed by the configuration map that will be mounted to the driver/executor
-
The mount location of the volume (this will contain a file
/arguments/job-args.txt
)
Below are listed the CRD fields that can be defined by the user:
CRD field | Remarks |
---|---|
|
|
|
|
|
Job name |
|
"1.0" |
|
|
|
User-supplied image containing spark-job dependencies that will be copied to the specified volume mount |
|
Spark image which will be deployed to driver and executor pods, which must contain spark environment needed by the job e.g. |
|
The actual application file that will be called by |
|
The main class i.e. entry point for JVM artifacts |
|
Arguments passed directly to the job artifact |
|
Name of the credentials secret for S3 access |
|
S3 endpoint |
|
A map of key/value strings that will be passed directly to |
|
A list of python packages that will be installed via |
|
A list of packages that is passed directly to |
|
A list of excluded packages that is passed directly to |
|
A list of repositories that is passed directly to |
|
A list of volumes |
|
The volume name |
|
The persistent volume claim backing the volume |
|
Number of cores used by the driver (only in cluster mode) |
|
Total cores for all executors |
|
Specified memory for the driver |
|
A list of mounted volumes for the driver |
|
Name of mount |
|
Volume mount path |
|
Number of cores for each executor |
|
Number of executor instances launched for this job |
|
Memory specified for executor |
|
A list of mounted volumes for each executor |
|
Name of mount |
|
Volume mount path |