You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/modules/spark-k8s/pages/usage-guide/job-dependencies.adoc
+74-31Lines changed: 74 additions & 31 deletions
Original file line number
Diff line number
Diff line change
@@ -3,33 +3,32 @@
3
3
4
4
== Overview
5
5
6
-
IMPORTANT: With the platform release 23.4.1 (and all previous releases), dynamic provisioning of dependencies using the Spark `packages` field doesn't work. This is a known problem with Spark and is tracked https://github.com/stackabletech/spark-k8s-operator/issues/141[here].
6
+
IMPORTANT: With the platform release 23.4.1 and Apache Spark 3.3.x (and all previous releases), dynamic provisioning of dependencies using the Spark `packages` field doesn't work.
7
+
This is a known problem with Spark and is tracked https://github.com/stackabletech/spark-k8s-operator/issues/141[here].
7
8
8
-
The Stackable Spark-on-Kubernetes operator enables users to run Apache Spark workloads in a Kubernetes cluster easily by eliminating the requirement of having a local Spark installation. For this purpose, Stackble provides ready made Docker images with recent versions of Apache Spark and Python - for PySpark jobs - that provide the basis for running those workloads. Users of the Stackable Spark-on-Kubernetes operator can run their workloads on any recent Kubernetes cluster by applying a `SparkApplication` custom resource in which the job code, job dependencies, input and output data locations can be specified. The Stackable operator translates the user's `SparkApplication` manifest into a Kubernetes `Job` object and handles control to the Apache Spark scheduler for Kubernetes to construct the necessary driver and executor `Pods`.
9
-
10
-
image::spark-k8s.png[Job Flow]
11
-
12
-
When the job is finished, the `Pods` are terminated and the Kubernetes `Job` is completed.
13
-
14
-
The base images provided by Stackable contain only the minimum of components to run Spark workloads. This is done mostly for performance and compatibility reasons. Many Spark workloads build on top of third party libraries and frameworks and thus depend on additional packages that are not included in the Stackable images. This guide explains how users can provision their Spark jobs with additional dependencies.
9
+
The container images provided by Stackable include Apache Spark and PySpark applications and libraries.
10
+
In addition, they include commonly used libraries to connect to storage systems supporting the `hdfs://`, `s3a://` and `abfs://` protocols. These systems are commonly used to store data processed by Spark applications.
15
11
12
+
Sometimes the applications need to integrate with additional systems or use processing algorithms not included in the Apache Spark distribution.
13
+
This guide explains how you can provision your Spark jobs with additional dependencies to support these requirements.
16
14
17
15
== Dependency provisioning
18
16
19
17
There are multiple ways to submit Apache Spark jobs with external dependencies. Each has its own advantages and disadvantages and the choice of one over the other depends on existing technical and managerial constraints.
20
18
21
-
To provision job dependencies in their workloads, users have to construct their `SparkApplication` with one of the following dependency specifications:
19
+
To provision job dependencies in Spark workloads, you construct the `SparkApplication` with one of the following dependency specifications:
22
20
23
-
- Hardened or encapsulated job images
24
-
- Dependency volumes
25
-
- Spark native package coordinates and Python requirements
21
+
* Custom Spark images
22
+
* Dependency volumes
23
+
* Maven/Java packages
24
+
* Python packages
26
25
27
26
The following table provides a high level overview of the relevant aspects of each method.
@@ -39,30 +38,54 @@ The following table provides a high level overview of the relevant aspects of ea
39
38
|Guaranteed
40
39
|Small to Medium
41
40
42
-
|Spark and Python packages
41
+
|Maven/Java packages
42
+
|Small
43
+
|Not guaranteed
44
+
|Small
45
+
46
+
|Python packages
43
47
|Small
44
-
|Not guranteed
48
+
|Not guaranteed
45
49
|Small
46
50
|===
47
51
48
-
=== Hardened or encapsulated job images
52
+
=== Custom Spark images
53
+
54
+
With this method, you submit a `SparkApplication` for which the `sparkImage` refers to the full custom image name. It is recommended to start the custom image from one of the Stackable images to ensure compatibility with the Stackable operator.
55
+
56
+
Below is an example of a custom image that includes a JDBC driver:
49
57
50
-
With this method, users submit a `SparkApplication` for which the `sparkImage` refers to a Docker image containing Apache Spark itself, the job code and dependencies required by the job. It is recommended the users base their image on one of the Stackable images to ensure compatibility with the Stackable operator.
58
+
[source, Dockerfile]
59
+
----
60
+
FROM docker.stackable.tech/stackable/spark-k8s:3.5.1-stackable24.3.0 # <1>
61
+
62
+
RUN curl --fail -o /stackable/spark/jars/postgresql-42.6.0.jar "https://jdbc.postgresql.org/download/postgresql-42.6.0.jar"
63
+
----
51
64
52
-
Since all packages required to run the Spark job are bundled in the image, the size of this image tends to get very large while at the same time guaranteeing reproducibility between submissions.
65
+
<1> Start from an existing Stackable image.
53
66
54
-
Example:
67
+
And the following snippet showcases an application that uses the custom image:
<4> the name of the volume mount backed by a `PersistentVolumeClaim` that must be pre-existing
75
98
<5> the path on the volume mount: this is referenced in the `sparkConf` section where the extra class path is defined for the driver and executors
76
99
77
-
NOTE: The Spark operator has no control over the contents of the dependency volume. It is the responsibility of the user to make sure all required dependencies are installed in the correct versions.
100
+
NOTE: The Spark operator has no control over the contents of the dependency volume. It is your responsibility to make sure all required dependencies are installed in the correct versions.
78
101
79
102
A `PersistentVolumeClaim` and the associated `PersistentVolume` can be defined like this:
IMPORTANT: Currently it's not possible to provision dependencies that are loaded by the JVM's (system class loader)[https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/ClassLoader.html#getSystemClassLoader()]. Such dependencies include JDBC drivers. If you need access to JDBC sources from your Spark application, consider building your own custom Spark image.
140
+
<1> Maven package coordinates for Apache Iceberg. This is downloaded from the Manven repository and made available to the Spark application.
113
141
114
-
IMPORTANT: Spark version 3.3.x has a https://issues.apache.org/jira/browse/SPARK-35084[known bug] that prevents this mechanism to work.
142
+
IMPORTANT: Currently it's not possible to provision dependencies that are loaded by the JVM's https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/ClassLoader.html#getSystemClassLoader()[system class loader].
143
+
Such dependencies include JDBC drivers.
144
+
If you need access to JDBC sources from your Spark application, consider building your own custom Spark image as shown above.
115
145
116
-
When submitting PySpark jobs, users can specify `pip` requirements that are installed before the driver and executor pods are created.
146
+
=== Python packages
147
+
148
+
When submitting PySpark jobs, users can specify additional Python requirements that are installed before the driver and executor pods are created.
0 commit comments