Skip to content

Commit 397ae52

Browse files
author
Felix Hennig
committed
usage guide restructuring
1 parent bd19c4a commit 397ae52

File tree

12 files changed

+318
-352
lines changed

12 files changed

+318
-352
lines changed
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
= CRD reference
2+
3+
Below are listed the CRD fields that can be defined by the user:
4+
5+
|===
6+
|CRD field |Remarks
7+
8+
|`apiVersion`
9+
|`spark.stackable.tech/v1alpha1`
10+
11+
|`kind`
12+
|`SparkApplication`
13+
14+
|`metadata.name`
15+
| Job name
16+
17+
|`spec.version`
18+
|"1.0"
19+
20+
|`spec.mode`
21+
| `cluster` or `client`. Currently only `cluster` is supported
22+
23+
|`spec.image`
24+
|User-supplied image containing spark-job dependencies that will be copied to the specified volume mount
25+
26+
|`spec.sparkImage`
27+
| Spark image which will be deployed to driver and executor pods, which must contain spark environment needed by the job e.g. `docker.stackable.tech/stackable/spark-k8s:3.3.0-stackable0.3.0`
28+
29+
|`spec.sparkImagePullPolicy`
30+
| Optional Enum (one of `Always`, `IfNotPresent` or `Never`) that determines the pull policy of the spark job image
31+
32+
|`spec.sparkImagePullSecrets`
33+
| An optional list of references to secrets in the same namespace to use for pulling any of the images used by a `SparkApplication` resource. Each reference has a single property (`name`) that must contain a reference to a valid secret
34+
35+
|`spec.mainApplicationFile`
36+
|The actual application file that will be called by `spark-submit`
37+
38+
|`spec.mainClass`
39+
|The main class i.e. entry point for JVM artifacts
40+
41+
|`spec.args`
42+
|Arguments passed directly to the job artifact
43+
44+
|`spec.s3connection`
45+
|S3 connection specification. See the <<S3 bucket specification>> for more details.
46+
47+
|`spec.sparkConf`
48+
|A map of key/value strings that will be passed directly to `spark-submit`
49+
50+
|`spec.deps.requirements`
51+
|A list of python packages that will be installed via `pip`
52+
53+
|`spec.deps.packages`
54+
|A list of packages that is passed directly to `spark-submit`
55+
56+
|`spec.deps.excludePackages`
57+
|A list of excluded packages that is passed directly to `spark-submit`
58+
59+
|`spec.deps.repositories`
60+
|A list of repositories that is passed directly to `spark-submit`
61+
62+
|`spec.volumes`
63+
|A list of volumes
64+
65+
|`spec.volumes.name`
66+
|The volume name
67+
68+
|`spec.volumes.persistentVolumeClaim.claimName`
69+
|The persistent volume claim backing the volume
70+
71+
|`spec.job.resources`
72+
|Resources specification for the initiating Job
73+
74+
|`spec.driver.resources`
75+
|Resources specification for the driver Pod
76+
77+
|`spec.driver.volumeMounts`
78+
|A list of mounted volumes for the driver
79+
80+
|`spec.driver.volumeMounts.name`
81+
|Name of mount
82+
83+
|`spec.driver.volumeMounts.mountPath`
84+
|Volume mount path
85+
86+
|`spec.driver.nodeSelector`
87+
|A dictionary of labels to use for node selection when scheduling the driver N.B. this assumes there are no implicit node dependencies (e.g. `PVC`, `VolumeMount`) defined elsewhere.
88+
89+
|`spec.executor.resources`
90+
|Resources specification for the executor Pods
91+
92+
|`spec.executor.instances`
93+
|Number of executor instances launched for this job
94+
95+
|`spec.executor.volumeMounts`
96+
|A list of mounted volumes for each executor
97+
98+
|`spec.executor.volumeMounts.name`
99+
|Name of mount
100+
101+
|`spec.executor.volumeMounts.mountPath`
102+
|Volume mount path
103+
104+
|`spec.executor.nodeSelector`
105+
|A dictionary of labels to use for node selection when scheduling the executors N.B. this assumes there are no implicit node dependencies (e.g. `PVC`, `VolumeMount`) defined elsewhere.
106+
|===

docs/modules/spark-k8s/pages/getting_started/installation.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Spark applications almost always require dependencies like database drivers, RES
88

99
More information about the different ways to define Spark jobs and their dependencies is given on the following pages:
1010

11-
- xref:usage.adoc[]
11+
- xref:usage-guide/index.adoc[]
1212
- xref:job_dependencies.adoc[]
1313

1414
== Stackable Operators

docs/modules/spark-k8s/pages/index.adoc

Lines changed: 29 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,37 @@
1-
= Stackable Operator for Apache Spark on Kubernetes
1+
= Stackable Operator for Apache Spark
2+
:description: The Stackable Operator for Apache Spark is a Kubernetes operator that can manage Apache Spark clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Spark versions.
3+
:keywords: Stackable Operator, Apache Spark, Kubernetes, operator, data science, engineer, big data, CRD, StatefulSet, ConfigMap, Service, S3, demo, version
24

3-
This is an operator for Kubernetes that can manage https://spark.apache.org/[Apache Spark] kubernetes clusters.
5+
This is an operator for Kubernetes that can manage https://spark.apache.org/[Apache Spark] Kubernetes clusters. Apache Spark is a powerful open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, real-time streaming, machine learning, and graph processing.
46

5-
WARNING: This operator only works with images from the https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fspark[Stackable] repository
7+
== Getting Started
8+
9+
Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable Operator. The guide will lead you through the installation of the Operator and running your first Spark job on Kubernetes.
10+
11+
== RBAC
12+
13+
The https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac[Spark-Kubernetes RBAC documentation] describes what is needed for `spark-submit` jobs to run successfully: minimally a role/cluster-role to allow the driver pod to create and manage executor pods.
14+
15+
However, to add security, each `spark-submit` job launched by the spark-k8s operator will be assigned its own service account.
16+
17+
When the spark-k8s operator is installed via Helm, a cluster role named `spark-k8s-clusterrole` is created with pre-defined permissions.
18+
19+
When a new Spark application is submitted, the operator creates a new service account with the same name as the application and binds this account to the cluster role `spark-k8s-clusterrole` created by Helm.
20+
21+
== Integrations
22+
23+
- Kafka
24+
- S3
25+
- loading custom dependencies
26+
27+
== Demos
28+
29+
The xref:stackablectl::demos/data-lakehouse-iceberg-trino-spark.adoc[] demo connects multiple components and datasets into a data Lakehouse. A Spark application with https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html[structured streaming] is used to stream data from Apache Kafka into the Lakehouse.
30+
31+
In the xref:stackablectl::demos/spark-k8s-anomaly-detection-taxi-data.adoc[] demo Spark is used to read training data from S3 and train an anomaly detection model on the data. The model is then stored in a Trino table.
632

733
== Supported Versions
834

935
The Stackable Operator for Apache Spark on Kubernetes currently supports the following versions of Spark:
1036

1137
include::partial$supported-versions.adoc[]
12-
13-
== Getting the Docker image
14-
15-
[source]
16-
----
17-
docker pull docker.stackable.tech/stackable/spark-k8s:<version>
18-
----

docs/modules/spark-k8s/pages/rbac.adoc

Lines changed: 0 additions & 11 deletions
This file was deleted.

docs/modules/spark-k8s/pages/history_server.adoc renamed to docs/modules/spark-k8s/pages/usage-guide/history-server.adoc

Lines changed: 1 addition & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
= Spark History Server
2+
:page-aliases: history_server.adoc
23

34
== Overview
45

@@ -48,23 +49,6 @@ include::example$example-history-app.yaml[]
4849
<6> Credentials used to write event logs. These can, of course, differ from the credentials used to process data.
4950

5051

51-
== Log aggregation
52-
53-
The logs can be forwarded to a Vector log aggregator by providing a discovery
54-
ConfigMap for the aggregator and by enabling the log agent:
55-
56-
[source,yaml]
57-
----
58-
spec:
59-
vectorAggregatorConfigMapName: vector-aggregator-discovery
60-
nodes:
61-
config:
62-
logging:
63-
enableVectorAgent: true
64-
----
65-
66-
Further information on how to configure logging, can be found in
67-
xref:home:concepts:logging.adoc[].
6852

6953
== History Web UI
7054

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
= Usage guide
2+
3+
== Examples
4+
5+
The following examples have the following `spec` fields in common:
6+
7+
- `version`: the current version is "1.0"
8+
- `sparkImage`: the docker image that will be used by job, driver and executor pods. This can be provided by the user.
9+
- `mode`: only `cluster` is currently supported
10+
- `mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
11+
- `args`: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.
12+
- `sparkConf`: these list spark configuration settings that are passed directly to `spark-submit` and which are best defined explicitly by the user. Since the `SparkApplication` "knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together.
13+
- `volumes`: refers to any volumes needed by the `SparkApplication`, in this case an underlying `PersistentVoulmeClaim`.
14+
- `driver`: driver-specific settings, including any volume mounts.
15+
- `executor`: executor-specific settings, including any volume mounts.
16+
17+
Job-specific settings are annotated below.
18+
19+
=== Pyspark: externally located artifact and dataset
20+
21+
[source,yaml]
22+
----
23+
include::example$example-sparkapp-external-dependencies.yaml[]
24+
----
25+
26+
<1> Job python artifact (external)
27+
<2> Job argument (external)
28+
<3> List of python job requirements: these will be installed in the pods via `pip`
29+
<4> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3)
30+
<5> the name of the volume mount backed by a `PersistentVolumeClaim` that must be pre-existing
31+
<6> the path on the volume mount: this is referenced in the `sparkConf` section where the extra class path is defined for the driver and executors
32+
33+
=== Pyspark: externally located dataset, artifact available via PVC/volume mount
34+
35+
[source,yaml]
36+
----
37+
include::example$example-sparkapp-image.yaml[]
38+
----
39+
40+
<1> Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC
41+
<2> Job python artifact (local)
42+
<3> Job argument (external)
43+
<4> List of python job requirements: these will be installed in the pods via `pip`
44+
<5> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)
45+
46+
=== JVM (Scala): externally located artifact and dataset
47+
48+
[source,yaml]
49+
----
50+
include::example$example-sparkapp-pvc.yaml[]
51+
----
52+
53+
<1> Job artifact located on S3.
54+
<2> Job main class
55+
<3> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)
56+
<4> the name of the volume mount backed by a `PersistentVolumeClaim` that must be pre-existing
57+
<5> the path on the volume mount: this is referenced in the `sparkConf` section where the extra class path is defined for the driver and executors
58+
59+
=== JVM (Scala): externally located artifact accessed with credentials
60+
61+
[source,yaml]
62+
----
63+
include::example$example-sparkapp-s3-private.yaml[]
64+
----
65+
66+
<1> Job python artifact (located in an S3 store)
67+
<2> Artifact class
68+
<3> S3 section, specifying the existing secret and S3 end-point (in this case, MinIO)
69+
<4> Credentials referencing a secretClass (not shown in is example)
70+
<5> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources...
71+
<6> ...in this case, in an S3 store, accessed with the credentials defined in the secret
72+
73+
=== JVM (Scala): externally located artifact accessed with job arguments provided via configuration map
74+
75+
[source,yaml]
76+
----
77+
include::example$example-configmap.yaml[]
78+
----
79+
[source,yaml]
80+
----
81+
include::example$example-sparkapp-configmap.yaml[]
82+
----
83+
<1> Name of the configuration map
84+
<2> Argument required by the job
85+
<3> Job scala artifact that requires an input argument
86+
<4> The volume backed by the configuration map
87+
<5> The expected job argument, accessed via the mounted configuration map file
88+
<6> The name of the volume backed by the configuration map that will be mounted to the driver/executor
89+
<7> The mount location of the volume (this will contain a file `/arguments/job-args.txt`)

docs/modules/spark-k8s/pages/job_dependencies.adoc renamed to docs/modules/spark-k8s/pages/usage-guide/job-dependencies.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
= Job Dependencies
2+
:page-aliases: job_dependencies.adoc
23

34
== Overview
45

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
= Resource Requests
2+
3+
include::home:concepts:stackable_resource_requests.adoc[]
4+
5+
If no resources are configured explicitly, the operator uses the following defaults:
6+
7+
[source,yaml]
8+
----
9+
job:
10+
resources:
11+
cpu:
12+
min: '500m'
13+
max: "1"
14+
memory:
15+
limit: '1Gi'
16+
driver:
17+
resources:
18+
cpu:
19+
min: '1'
20+
max: "2"
21+
memory:
22+
limit: '2Gi'
23+
executor:
24+
resources:
25+
cpu:
26+
min: '1'
27+
max: "4"
28+
memory:
29+
limit: '4Gi'
30+
----
31+
WARNING: The default values are _most likely_ not sufficient to run a proper cluster in production. Please adapt according to your requirements.
32+
For more details regarding Kubernetes CPU limits see: https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/[Assign CPU Resources to Containers and Pods].
33+
34+
Spark allocates a default amount of non-heap memory based on the type of job (JVM or non-JVM). This is taken into account when defining memory settings based exclusively on the resource limits, so that the "declared" value is the actual total value (i.e. including memory overhead). This may result in minor deviations from the stated resource value due to rounding differences.
35+
36+
NOTE: It is possible to define Spark resources either directly by setting configuration properties listed under `sparkConf`, or by using resource limits. If both are used, then `sparkConf` properties take precedence. It is recommended for the sake of clarity to use *_either_* one *_or_* the other.
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
= S3 bucket specification
2+
3+
You can specify S3 connection details directly inside the `SparkApplication` specification or by referring to an external `S3Bucket` custom resource.
4+
5+
To specify S3 connection details directly as part of the `SparkApplication` resource you add an inline connection configuration as shown below.
6+
7+
[source,yaml]
8+
----
9+
s3connection: # <1>
10+
inline:
11+
host: test-minio # <2>
12+
port: 9000 # <3>
13+
accessStyle: Path
14+
credentials:
15+
secretClass: s3-credentials-class # <4>
16+
----
17+
<1> Entry point for the S3 connection configuration.
18+
<2> Connection host.
19+
<3> Optional connection port.
20+
<4> Name of the `Secret` object expected to contain the following keys: `ACCESS_KEY_ID` and `SECRET_ACCESS_KEY`
21+
22+
It is also possible to configure the connection details as a separate Kubernetes resource and only refer to that object from the `SparkApplication` like this:
23+
24+
[source,yaml]
25+
----
26+
s3connection:
27+
reference: s3-connection-resource # <1>
28+
----
29+
<1> Name of the connection resource with connection details.
30+
31+
The resource named `s3-connection-resource` is then defined as shown below:
32+
33+
[source,yaml]
34+
----
35+
---
36+
apiVersion: s3.stackable.tech/v1alpha1
37+
kind: S3Connection
38+
metadata:
39+
name: s3-connection-resource
40+
spec:
41+
host: test-minio
42+
port: 9000
43+
accessStyle: Path
44+
credentials:
45+
secretClass: minio-credentials-class
46+
----
47+
48+
This has the advantage that one connection configuration can be shared across `SparkApplications` and reduces the cost of updating these details.

0 commit comments

Comments
 (0)