Skip to content

Commit 87fc2b1

Browse files
adwk67razvan
andcommitted
Getting start docs (#114)
# Description Addition of "getting started" documentation Co-authored-by: Razvan-Daniel Mihai <[email protected]>
1 parent 8d58660 commit 87fc2b1

18 files changed

+354
-80
lines changed

CHANGELOG.md

+8
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,19 @@ All notable changes to this project will be documented in this file.
44

55
## [Unreleased]
66

7+
### Added
8+
9+
- Add Getting Started documentation ([#114]).
10+
11+
[#114]: https://github.com/stackabletech/spark-k8s-operator/pull/114
12+
713
### Fixed
814

915
- Add missing role to read S3Connection and S3Bucket objects ([#112]).
16+
- Update annotation due to update to rust version ([#114]).
1017

1118
[#112]: https://github.com/stackabletech/spark-k8s-operator/pull/112
19+
[#114]: https://github.com/stackabletech/spark-k8s-operator/pull/114
1220

1321
## [0.4.0] - 2022-08-03
1422

docs/antora.yml

+1
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,6 @@ name: spark-k8s
33
version: "nightly"
44
title: Stackable Operator for Apache Spark on Kubernetes
55
nav:
6+
- modules/getting_started/nav.adoc
67
- modules/ROOT/nav.adoc
78
prerelease: true

docs/modules/ROOT/nav.adoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
* xref:installation.adoc[]
1+
* xref:configuration.adoc[]
22
* xref:usage.adoc[]
33
* xref:job_dependencies.adoc[]
44
* xref:rbac.adoc[]

docs/modules/ROOT/pages/installation.adoc

-51
This file was deleted.

docs/modules/ROOT/pages/usage.adoc

-27
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,5 @@
11
= Usage
22

3-
== Create an Apache Spark job
4-
5-
If you followed the installation instructions, you should now have a Stackable Operator for Apache Spark up and running, and you are ready to create your first Apache Spark kubernetes cluster.
6-
7-
The example below creates a job running on Apache Spark 3.3.0, using the spark-on-kubernetes paradigm described in the spark documentation. The application file is itself part of the spark distribution and `local` refers to the path on the driver/executors; there are no external dependencies.
8-
9-
cat <<EOF | kubectl apply -f -
10-
apiVersion: spark.stackable.tech/v1alpha1
11-
kind: SparkApplication
12-
metadata:
13-
name: spark-clustermode-001
14-
spec:
15-
version: 1.0
16-
mode: cluster
17-
mainClass: org.apache.spark.examples.SparkPi
18-
mainApplicationFile: local:///stackable/spark/examples/jars/spark-examples_2.12-3.3.0.jar
19-
image: 3.3.0-stackable0.1.0
20-
driver:
21-
cores: 1
22-
coreLimit: "1200m"
23-
memory: "512m"
24-
executor:
25-
cores: 1
26-
instances: 3
27-
memory: "512m"
28-
EOF
29-
303
== Examples
314

325
The following examples have the following `spec` fields in common:
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
# This script contains all the code snippets from the guide, as well as some assert tests
5+
# to test if the instructions in the guide work. The user *could* use it, but it is intended
6+
# for testing only.
7+
# The script will install the operators, create a superset instance and briefly open a port
8+
# forward and connect to the superset instance to make sure it is up and running.
9+
# No running processes are left behind (i.e. the port-forwarding is closed at the end)
10+
11+
if [ $# -eq 0 ]
12+
then
13+
echo "Installation method argument ('helm' or 'stackablectl') required."
14+
exit 1
15+
fi
16+
17+
case "$1" in
18+
"helm")
19+
echo "Adding 'stackable-dev' Helm Chart repository"
20+
# tag::helm-add-repo[]
21+
helm repo add stackable-dev https://repo.stackable.tech/repository/helm-dev/
22+
# end::helm-add-repo[]
23+
echo "Installing Operators with Helm"
24+
# tag::helm-install-operators[]
25+
helm install --wait commons-operator stackable-dev/commons-operator --version 0.3.0-nightly
26+
helm install --wait secret-operator stackable-dev/secret-operator --version 0.6.0-nightly
27+
helm install --wait spark-k8s-operator stackable-dev/spark-k8s-operator --version 0.5.0-nightly
28+
# end::helm-install-operators[]
29+
;;
30+
"stackablectl")
31+
echo "installing Operators with stackablectl"
32+
# tag::stackablectl-install-operators[]
33+
stackablectl operator install \
34+
commons=0.3.0-nightly \
35+
secret=0.6.0-nightly \
36+
spark-k8s=0.5.0-nightly
37+
# end::stackablectl-install-operators[]
38+
;;
39+
*)
40+
echo "Need to give 'helm' or 'stackablectl' as an argument for which installation method to use!"
41+
exit 1
42+
;;
43+
esac
44+
45+
echo "Creating a Spark Application..."
46+
# tag::install-sparkapp[]
47+
kubectl apply -f pyspark-pi.yaml
48+
# end::install-sparkapp[]
49+
50+
echo "Waiting for job to complete ..."
51+
# tag::wait-for-job[]
52+
kubectl wait pods -l 'job-name=pyspark-pi' \
53+
--for jsonpath='{.status.phase}'=Succeeded \
54+
--timeout 300s
55+
# end::wait-for-job[]
56+
57+
result=$(kubectl logs -l 'spark-role=driver' --tail=-1 | grep "Pi is roughly")
58+
59+
if [ "$result" == "" ]; then
60+
echo "Log result was not found!"
61+
exit 1
62+
else
63+
echo "Job result: $result"
64+
fi
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
# This script contains all the code snippets from the guide, as well as some assert tests
5+
# to test if the instructions in the guide work. The user *could* use it, but it is intended
6+
# for testing only.
7+
# The script will install the operators, create a superset instance and briefly open a port
8+
# forward and connect to the superset instance to make sure it is up and running.
9+
# No running processes are left behind (i.e. the port-forwarding is closed at the end)
10+
11+
if [ $# -eq 0 ]
12+
then
13+
echo "Installation method argument ('helm' or 'stackablectl') required."
14+
exit 1
15+
fi
16+
17+
case "$1" in
18+
"helm")
19+
echo "Adding 'stackable-dev' Helm Chart repository"
20+
# tag::helm-add-repo[]
21+
helm repo add stackable-dev https://repo.stackable.tech/repository/helm-dev/
22+
# end::helm-add-repo[]
23+
echo "Installing Operators with Helm"
24+
# tag::helm-install-operators[]
25+
helm install --wait commons-operator stackable-dev/commons-operator --version {{ versions.commons }}
26+
helm install --wait secret-operator stackable-dev/secret-operator --version {{ versions.secret }}
27+
helm install --wait spark-k8s-operator stackable-dev/spark-k8s-operator --version {{ versions.spark }}
28+
# end::helm-install-operators[]
29+
;;
30+
"stackablectl")
31+
echo "installing Operators with stackablectl"
32+
# tag::stackablectl-install-operators[]
33+
stackablectl operator install \
34+
commons={{ versions.commons }} \
35+
secret={{ versions.secret }} \
36+
spark-k8s={{ versions.spark }}
37+
# end::stackablectl-install-operators[]
38+
;;
39+
*)
40+
echo "Need to give 'helm' or 'stackablectl' as an argument for which installation method to use!"
41+
exit 1
42+
;;
43+
esac
44+
45+
echo "Creating a Spark Application..."
46+
# tag::install-sparkapp[]
47+
kubectl apply -f pyspark-pi.yaml
48+
# end::install-sparkapp[]
49+
50+
echo "Waiting for job to complete ..."
51+
# tag::wait-for-job[]
52+
kubectl wait pods -l 'job-name=pyspark-pi' \
53+
--for jsonpath='{.status.phase}'=Succeeded \
54+
--timeout 300s
55+
# end::wait-for-job[]
56+
57+
result=$(kubectl logs -l 'spark-role=driver' --tail=-1 | grep "Pi is roughly")
58+
59+
if [ "$result" == "" ]; then
60+
echo "Log result was not found!"
61+
exit 1
62+
else
63+
echo "Job result:" "$result"
64+
fi
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
---
2+
apiVersion: spark.stackable.tech/v1alpha1
3+
kind: SparkApplication
4+
metadata:
5+
name: pyspark-pi
6+
namespace: default
7+
spec:
8+
version: "1.0"
9+
sparkImage: docker.stackable.tech/stackable/pyspark-k8s:3.3.0-stackable0.1.0
10+
mode: cluster
11+
mainApplicationFile: local:///stackable/spark/examples/src/main/python/pi.py
12+
driver:
13+
cores: 1
14+
coreLimit: "1200m"
15+
memory: "512m"
16+
executor:
17+
cores: 1
18+
instances: 3
19+
memory: "512m"
Loading
Loading
Loading

docs/modules/getting_started/nav.adoc

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
* xref:index.adoc[]
2+
** xref:installation.adoc[]
3+
** xref:first_steps.adoc[]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
= First steps
2+
3+
Once you have followed the steps in the xref:installation.adoc[] section to install the operator and its dependencies, you will now create a Spark job. Afterwards you can <<_verify_that_it_works, verify that it works>> by looking at the logs from the driver pod.
4+
5+
== Starting a Spark job
6+
7+
A Spark application is made of up three components:
8+
9+
- Job: this will build a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
10+
- Driver: the driver starts the designated number of executors and removes them when the job is completed.
11+
- Executor(s): responsible for executing the job itself
12+
13+
Create a file named `pyspark-pi.yaml` with the following contents:
14+
15+
[source,yaml]
16+
----
17+
include::example$code/pyspark-pi.yaml[]
18+
----
19+
20+
And apply it:
21+
22+
----
23+
include::example$code/getting_started.sh[tag=install-sparkapp]
24+
----
25+
26+
Where:
27+
28+
- `metadata.name` contains the name of the SparkApplication
29+
- `spec.version`: the current version is "1.0"
30+
- `spec.sparkImage`: the docker image that will be used by job, driver and executor pods. This can be provided by the user.
31+
- `spec.mode`: only `cluster` is currently supported
32+
- `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. This path is relative to the image, so in this case we are running an example python script (that calculates the value of pi): it is bundled with the Spark code and therefore already present in the job image
33+
- `spec.driver`: driver-specific settings.
34+
- `spec.executor`: executor-specific settings.
35+
36+
37+
NOTE: If using Stackable image versions, please note that the version you need to specify for `spec.version` is not only the version of Spark which you want to roll out, but has to be amended with a Stackable version as shown. This Stackable version is the version of the underlying container image which is used to execute the processes. For a list of available versions please check our
38+
https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%spark-k8s%2Ftags[image registry].
39+
It should generally be safe to simply use the latest image version that is available.
40+
41+
This will create the `SparkApplication` that in turn creates the Spark job.
42+
43+
== Verify that it works
44+
45+
As mentioned above, the `SparkApplication` that has just been created will build a `spark-submit` command and pass it to the driver pod, which in turn will create executor pods that run for the duration of the job before being clean up. A running process will look like this:
46+
47+
image::spark_running.png[Spark job]
48+
49+
- `pyspark-pi-xxxx`: this is the initialising job that creates the spark-submit command (named as `metadata.name` with a unique suffix)
50+
- `pyspark-pi-xxxxxxx-driver`: the driver pod that drives the execution
51+
- `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in our example `spec.executor.instances` was set to 3 which is why we have 3 executors)
52+
53+
Job progress can be followed by issuing this command:
54+
55+
----
56+
include::example$code/getting_started.sh[tag=wait-for-job]
57+
----
58+
59+
When the job completes the driver cleans up the executor. The initial job is persisted for several minutes before being removed. The completed state will look like this:
60+
61+
image::spark_complete.png[Completed job]
62+
63+
The driver logs can be inspected for more information about the results of the job. In this case we expect to find the results of our (approximate!) pi calculation:
64+
65+
image::spark_log.png[Driver log]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
= Getting started
2+
3+
This guide will get you started with Spark using the Stackable Operator. It will guide you through the installation of the Operator and its dependencies, executing your first Spark job and reviewing its result.
4+
5+
== Prerequisites
6+
7+
You will need:
8+
9+
* a Kubernetes cluster
10+
* kubectl
11+
* Helm
12+
13+
== What's next
14+
15+
The Guide is divided into two steps:
16+
17+
* xref:installation.adoc[Installing the Operators].
18+
* xref:first_steps.adoc[Starting a Spark job].

0 commit comments

Comments
 (0)