Skip to content

Commit 16297d2

Browse files
razvanfhennig
andauthored
docs: provision spark dependencies (#409)
* docs: provision spark dependencies * remove redundant overview text * remove unused image * spelling * Update docs/modules/spark-k8s/pages/usage-guide/job-dependencies.adoc Co-authored-by: Felix Hennig <[email protected]> * Update docs/modules/spark-k8s/pages/usage-guide/job-dependencies.adoc Co-authored-by: Felix Hennig <[email protected]> * Update docs/modules/spark-k8s/pages/usage-guide/job-dependencies.adoc Co-authored-by: Felix Hennig <[email protected]> * Review feedback * fix language lints --------- Co-authored-by: Felix Hennig <[email protected]>
1 parent 436ff43 commit 16297d2

File tree

4 files changed

+74
-79
lines changed

4 files changed

+74
-79
lines changed

docs/modules/spark-k8s/examples/example-encapsulated.yaml

Lines changed: 0 additions & 13 deletions
This file was deleted.

docs/modules/spark-k8s/examples/example-sparkapp-external-dependencies.yaml

Lines changed: 0 additions & 35 deletions
This file was deleted.
-206 KB
Binary file not shown.

docs/modules/spark-k8s/pages/usage-guide/job-dependencies.adoc

Lines changed: 74 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -3,33 +3,32 @@
33

44
== Overview
55

6-
IMPORTANT: With the platform release 23.4.1 (and all previous releases), dynamic provisioning of dependencies using the Spark `packages` field doesn't work. This is a known problem with Spark and is tracked https://github.com/stackabletech/spark-k8s-operator/issues/141[here].
6+
IMPORTANT: With the platform release 23.4.1 and Apache Spark 3.3.x (and all previous releases), dynamic provisioning of dependencies using the Spark `packages` field doesn't work.
7+
This is a known problem with Spark and is tracked https://github.com/stackabletech/spark-k8s-operator/issues/141[here].
78

8-
The Stackable Spark-on-Kubernetes operator enables users to run Apache Spark workloads in a Kubernetes cluster easily by eliminating the requirement of having a local Spark installation. For this purpose, Stackble provides ready made Docker images with recent versions of Apache Spark and Python - for PySpark jobs - that provide the basis for running those workloads. Users of the Stackable Spark-on-Kubernetes operator can run their workloads on any recent Kubernetes cluster by applying a `SparkApplication` custom resource in which the job code, job dependencies, input and output data locations can be specified. The Stackable operator translates the user's `SparkApplication` manifest into a Kubernetes `Job` object and handles control to the Apache Spark scheduler for Kubernetes to construct the necessary driver and executor `Pods`.
9-
10-
image::spark-k8s.png[Job Flow]
11-
12-
When the job is finished, the `Pods` are terminated and the Kubernetes `Job` is completed.
13-
14-
The base images provided by Stackable contain only the minimum of components to run Spark workloads. This is done mostly for performance and compatibility reasons. Many Spark workloads build on top of third party libraries and frameworks and thus depend on additional packages that are not included in the Stackable images. This guide explains how users can provision their Spark jobs with additional dependencies.
9+
The container images provided by Stackable include Apache Spark and PySpark applications and libraries.
10+
In addition, they include commonly used libraries to connect to storage systems supporting the `hdfs://`, `s3a://` and `abfs://` protocols. These systems are commonly used to store data processed by Spark applications.
1511

12+
Sometimes the applications need to integrate with additional systems or use processing algorithms not included in the Apache Spark distribution.
13+
This guide explains how you can provision your Spark jobs with additional dependencies to support these requirements.
1614

1715
== Dependency provisioning
1816

1917
There are multiple ways to submit Apache Spark jobs with external dependencies. Each has its own advantages and disadvantages and the choice of one over the other depends on existing technical and managerial constraints.
2018

21-
To provision job dependencies in their workloads, users have to construct their `SparkApplication` with one of the following dependency specifications:
19+
To provision job dependencies in Spark workloads, you construct the `SparkApplication` with one of the following dependency specifications:
2220

23-
- Hardened or encapsulated job images
24-
- Dependency volumes
25-
- Spark native package coordinates and Python requirements
21+
* Custom Spark images
22+
* Dependency volumes
23+
* Maven/Java packages
24+
* Python packages
2625

2726
The following table provides a high level overview of the relevant aspects of each method.
2827

2928
|===
3029
|Dependency specification |Job image size |Reproduciblity |Dev-op cost
3130

32-
|Encapsulated job images
31+
|Custom Spark images
3332
|Large
3433
|Guaranteed
3534
|Medium to High
@@ -39,30 +38,54 @@ The following table provides a high level overview of the relevant aspects of ea
3938
|Guaranteed
4039
|Small to Medium
4140

42-
|Spark and Python packages
41+
|Maven/Java packages
42+
|Small
43+
|Not guaranteed
44+
|Small
45+
46+
|Python packages
4347
|Small
44-
|Not guranteed
48+
|Not guaranteed
4549
|Small
4650
|===
4751

48-
=== Hardened or encapsulated job images
52+
=== Custom Spark images
53+
54+
With this method, you submit a `SparkApplication` for which the `sparkImage` refers to the full custom image name. It is recommended to start the custom image from one of the Stackable images to ensure compatibility with the Stackable operator.
55+
56+
Below is an example of a custom image that includes a JDBC driver:
4957

50-
With this method, users submit a `SparkApplication` for which the `sparkImage` refers to a Docker image containing Apache Spark itself, the job code and dependencies required by the job. It is recommended the users base their image on one of the Stackable images to ensure compatibility with the Stackable operator.
58+
[source, Dockerfile]
59+
----
60+
FROM docker.stackable.tech/stackable/spark-k8s:3.5.1-stackable24.3.0 # <1>
61+
62+
RUN curl --fail -o /stackable/spark/jars/postgresql-42.6.0.jar "https://jdbc.postgresql.org/download/postgresql-42.6.0.jar"
63+
----
5164

52-
Since all packages required to run the Spark job are bundled in the image, the size of this image tends to get very large while at the same time guaranteeing reproducibility between submissions.
65+
<1> Start from an existing Stackable image.
5366

54-
Example:
67+
And the following snippet showcases an application that uses the custom image:
5568

5669
[source, yaml]
5770
----
58-
include::example$example-encapsulated.yaml[]
71+
apiVersion: spark.stackable.tech/v1alpha1
72+
kind: SparkApplication
73+
metadata:
74+
name: spark-jdbc
75+
spec:
76+
sparkImage:
77+
custom: "docker.stackable.tech/sandbox/spark-k8s:3.5.1-stackable0.0.0-dev" # <1>
78+
productVersion: "3.5.1" # <2>
79+
pullPolicy: IfNotPresent # <3>
80+
...
5981
----
60-
<1> Name of the encapsulated image.
61-
<2> Name of the Spark job to run.
82+
<1> Name of the custom image.
83+
<2> Apache Spark version. Needed for the operator to take the correct actions.
84+
<3> Optional. Defaults to `Always`.
6285

6386
=== Dependency volumes
6487

65-
With this method, the user provisions the job dependencies from a `PersistentVolume` as shown in this example:
88+
With this method, the job dependencies are provisioned from a `PersistentVolume` as shown in this example:
6689

6790
[source,yaml]
6891
----
@@ -74,7 +97,7 @@ include::example$example-sparkapp-pvc.yaml[]
7497
<4> the name of the volume mount backed by a `PersistentVolumeClaim` that must be pre-existing
7598
<5> the path on the volume mount: this is referenced in the `sparkConf` section where the extra class path is defined for the driver and executors
7699

77-
NOTE: The Spark operator has no control over the contents of the dependency volume. It is the responsibility of the user to make sure all required dependencies are installed in the correct versions.
100+
NOTE: The Spark operator has no control over the contents of the dependency volume. It is your responsibility to make sure all required dependencies are installed in the correct versions.
78101

79102
A `PersistentVolumeClaim` and the associated `PersistentVolume` can be defined like this:
80103

@@ -88,14 +111,18 @@ include::example$example-pvc.yaml[]
88111
<4> Defines the `VolumeMount` that is used by the Custom Resource
89112

90113

91-
=== Spark native package coordinates and Python requirements
114+
=== Maven packages
92115

93116
The last and most flexible way to provision dependencies is to use the built-in `spark-submit` support for Maven package coordinates.
94117

95118
The snippet below showcases how to add Apache Iceberg support to a Spark (version 3.4.x) application.
96119

97120
[source,yaml]
98121
----
122+
apiVersion: spark.stackable.tech/v1alpha1
123+
kind: SparkApplication
124+
metadata:
125+
name: spark-iceberg
99126
spec:
100127
sparkConf:
101128
spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
@@ -106,20 +133,36 @@ spec:
106133
spark.sql.catalog.local.warehouse: /tmp/warehouse
107134
deps:
108135
packages:
109-
- org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1
136+
- org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 # <1>
137+
...
110138
----
111139

112-
IMPORTANT: Currently it's not possible to provision dependencies that are loaded by the JVM's (system class loader)[https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/ClassLoader.html#getSystemClassLoader()]. Such dependencies include JDBC drivers. If you need access to JDBC sources from your Spark application, consider building your own custom Spark image.
140+
<1> Maven package coordinates for Apache Iceberg. This is downloaded from the Manven repository and made available to the Spark application.
113141

114-
IMPORTANT: Spark version 3.3.x has a https://issues.apache.org/jira/browse/SPARK-35084[known bug] that prevents this mechanism to work.
142+
IMPORTANT: Currently it's not possible to provision dependencies that are loaded by the JVM's https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/ClassLoader.html#getSystemClassLoader()[system class loader].
143+
Such dependencies include JDBC drivers.
144+
If you need access to JDBC sources from your Spark application, consider building your own custom Spark image as shown above.
115145

116-
When submitting PySpark jobs, users can specify `pip` requirements that are installed before the driver and executor pods are created.
146+
=== Python packages
147+
148+
When submitting PySpark jobs, users can specify additional Python requirements that are installed before the driver and executor pods are created.
117149

118150
Here is an example:
119151

120152
[source,yaml]
121153
----
122-
include::example$example-sparkapp-external-dependencies.yaml[]
154+
apiVersion: spark.stackable.tech/v1alpha1
155+
kind: SparkApplication
156+
metadata:
157+
name: pyspark-report
158+
spec:
159+
mainApplicationFile: /app/run.py # <1>
160+
deps:
161+
requirements:
162+
- tabulate==0.8.9 # <3>
163+
...
123164
----
124165

125-
Note the section `requirements`. Also note that in this case, a `sparkImage` that bundles Python has to be provisioned.
166+
<1> The main application file. In this example it is assumed that the file is part of a custom image.
167+
<2> A Python package that is used by the application and installed when the application is submitted.
168+

0 commit comments

Comments
 (0)