diff --git a/CHANGELOG.md b/CHANGELOG.md index 830e5be0..7e8f594e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,11 +4,19 @@ All notable changes to this project will be documented in this file. ## [Unreleased] +### Added + +- Add Getting Started documentation ([#114]). + +[#114]: https://github.com/stackabletech/spark-k8s-operator/pull/114 + ### Fixed - Add missing role to read S3Connection and S3Bucket objects ([#112]). +- Update annotation due to update to rust version ([#114]). [#112]: https://github.com/stackabletech/spark-k8s-operator/pull/112 +[#114]: https://github.com/stackabletech/spark-k8s-operator/pull/114 ## [0.4.0] - 2022-08-03 diff --git a/docs/antora.yml b/docs/antora.yml index 0daba371..ebdcc6eb 100644 --- a/docs/antora.yml +++ b/docs/antora.yml @@ -3,5 +3,6 @@ name: spark-k8s version: "nightly" title: Stackable Operator for Apache Spark on Kubernetes nav: + - modules/getting_started/nav.adoc - modules/ROOT/nav.adoc prerelease: true diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc index 9ad2189a..cca2c4a9 100644 --- a/docs/modules/ROOT/nav.adoc +++ b/docs/modules/ROOT/nav.adoc @@ -1,4 +1,4 @@ -* xref:installation.adoc[] +* xref:configuration.adoc[] * xref:usage.adoc[] * xref:job_dependencies.adoc[] * xref:rbac.adoc[] diff --git a/docs/modules/ROOT/pages/installation.adoc b/docs/modules/ROOT/pages/installation.adoc deleted file mode 100644 index d2bfb8f1..00000000 --- a/docs/modules/ROOT/pages/installation.adoc +++ /dev/null @@ -1,51 +0,0 @@ -= Installation - -There are three ways to run the Spark Operator: - -1. Helm managed Docker container deployment on Kubernetes - -2. Build from source - -== Helm - -Helm allows you to download and deploy Stackable operators on Kubernetes and is by far the easiest -installation method. First ensure that you have installed the Stackable Operators Helm repository: - -[source,bash] ----- -helm repo add stackable https://repo.stackable.tech/repository/helm-stable/ ----- - -Then install the Stackable Operator for Apache Spark -[source,bash] ----- -helm install spark-k8s-operator stackable/spark-k8s-operator ----- - -Helm will deploy the operator in a Kubernetes container and apply the CRDs for the Apache Spark -service. You are now ready to deploy Apache Spark in Kubernetes. - -== Building the operator from source - -To run it from your local machine - usually for development purposes - you need to install the required manifest files. - -[source,bash] ----- -make renenerate-charts -kubectl create -f deploy/manifests ----- - -Then, start the operator: - -[source,bash] ----- -cargo run -- run ----- - -== Additional/Optional components - -The above describes the installation of the operator alone and is sufficient for spark jobs that do not require any external dependencies. In practice, this is often not the case and spark- and/or job-dependencies will be required. These can be made available in different ways - e.g. by including them in the spark images used by `spark-submit`, reading them external repositories or by using local external storage such as Kuberentes persistent volumes. See the <> page for detailed information. - -== Examples - -The examples provided with the operator code show different ways of combining these elements. diff --git a/docs/modules/ROOT/pages/usage.adoc b/docs/modules/ROOT/pages/usage.adoc index fe534593..fe0b5c6c 100644 --- a/docs/modules/ROOT/pages/usage.adoc +++ b/docs/modules/ROOT/pages/usage.adoc @@ -1,32 +1,5 @@ = Usage -== Create an Apache Spark job - -If you followed the installation instructions, you should now have a Stackable Operator for Apache Spark up and running, and you are ready to create your first Apache Spark kubernetes cluster. - -The example below creates a job running on Apache Spark 3.3.0, using the spark-on-kubernetes paradigm described in the spark documentation. The application file is itself part of the spark distribution and `local` refers to the path on the driver/executors; there are no external dependencies. - - cat <> by looking at the logs from the driver pod. + +== Starting a Spark job + +A Spark application is made of up three components: + +- Job: this will build a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods +- Driver: the driver starts the designated number of executors and removes them when the job is completed. +- Executor(s): responsible for executing the job itself + +Create a file named `pyspark-pi.yaml` with the following contents: + +[source,yaml] +---- +include::example$code/pyspark-pi.yaml[] +---- + +And apply it: + +---- +include::example$code/getting_started.sh[tag=install-sparkapp] +---- + +Where: + +- `metadata.name` contains the name of the SparkApplication +- `spec.version`: the current version is "1.0" +- `spec.sparkImage`: the docker image that will be used by job, driver and executor pods. This can be provided by the user. +- `spec.mode`: only `cluster` is currently supported +- `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. This path is relative to the image, so in this case we are running an example python script (that calculates the value of pi): it is bundled with the Spark code and therefore already present in the job image +- `spec.driver`: driver-specific settings. +- `spec.executor`: executor-specific settings. + + +NOTE: If using Stackable image versions, please note that the version you need to specify for `spec.version` is not only the version of Spark which you want to roll out, but has to be amended with a Stackable version as shown. This Stackable version is the version of the underlying container image which is used to execute the processes. For a list of available versions please check our +https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%spark-k8s%2Ftags[image registry]. +It should generally be safe to simply use the latest image version that is available. + +This will create the `SparkApplication` that in turn creates the Spark job. + +== Verify that it works + +As mentioned above, the `SparkApplication` that has just been created will build a `spark-submit` command and pass it to the driver pod, which in turn will create executor pods that run for the duration of the job before being clean up. A running process will look like this: + +image::spark_running.png[Spark job] + +- `pyspark-pi-xxxx`: this is the initialising job that creates the spark-submit command (named as `metadata.name` with a unique suffix) +- `pyspark-pi-xxxxxxx-driver`: the driver pod that drives the execution +- `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in our example `spec.executor.instances` was set to 3 which is why we have 3 executors) + +Job progress can be followed by issuing this command: + +---- +include::example$code/getting_started.sh[tag=wait-for-job] +---- + +When the job completes the driver cleans up the executor. The initial job is persisted for several minutes before being removed. The completed state will look like this: + +image::spark_complete.png[Completed job] + +The driver logs can be inspected for more information about the results of the job. In this case we expect to find the results of our (approximate!) pi calculation: + +image::spark_log.png[Driver log] \ No newline at end of file diff --git a/docs/modules/getting_started/pages/index.adoc b/docs/modules/getting_started/pages/index.adoc new file mode 100644 index 00000000..baf0c658 --- /dev/null +++ b/docs/modules/getting_started/pages/index.adoc @@ -0,0 +1,18 @@ += Getting started + +This guide will get you started with Spark using the Stackable Operator. It will guide you through the installation of the Operator and its dependencies, executing your first Spark job and reviewing its result. + +== Prerequisites + +You will need: + +* a Kubernetes cluster +* kubectl +* Helm + +== What's next + +The Guide is divided into two steps: + +* xref:installation.adoc[Installing the Operators]. +* xref:first_steps.adoc[Starting a Spark job]. \ No newline at end of file diff --git a/docs/modules/getting_started/pages/installation.adoc b/docs/modules/getting_started/pages/installation.adoc new file mode 100644 index 00000000..eafa80c5 --- /dev/null +++ b/docs/modules/getting_started/pages/installation.adoc @@ -0,0 +1,62 @@ += Installation + +On this page you will install the Stackable Spark-on-Kubernetes operator as well as the Commons and Secret operators which are required by all Stackable operators. + +== Dependencies + +Spark applications almost always require dependencies like database drivers, REST api clients and many others. These dependencies must be available on the `classpath` of each executor (and in some cases of the driver, too). There are multiple ways to provision Spark jobs with such dependencies: some are built into Spark itself while others are implemented at the operator level. In this guide we are going to keep things simple and look at executing a Spark job that has a minimum of dependencies. + +More information about the different ways to define Spark jobs and their dependencies is given on the following pages: + +- xref:ROOT:usage.adoc[] +- xref:ROOT:job_dependencies.adoc[] + +== Stackable Operators + +There are 2 ways to install Stackable operators + +1. Using xref:stackablectl::index.adoc[] + +2. Using a Helm chart + +=== stackablectl + +`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install Operators. +Follow the xref:stackablectl::installation.adoc[installation steps] for your platform. + +After you have installed `stackablectl` run the following command to install the Spark-k8s operator: + +[source,bash] +---- +include::example$code/getting_started.sh[tag=stackablectl-install-operators] +---- + +The tool will show + +---- +[INFO ] Installing commons operator +[INFO ] Installing secret operator +[INFO ] Installing spark-k8s operator +---- + +TIP: Consult the xref:stackablectl::quickstart.adoc[] to learn more about how to use stackablectl. For example, you can use the `-k` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind]. + +=== Helm + +You can also use Helm to install the operator. Add the Stackable Helm repository: +[source,bash] +---- +include::example$code/getting_started.sh[tag=helm-add-repo] +---- + +Then install the Stackable Operators: +[source,bash] +---- +include::example$code/getting_started.sh[tag=helm-install-operators] +---- + +Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the `SparkApplication` (as well as the CRDs for the required operators). You are now ready to create a Spark job. + +== What's next + +xref:first_steps.adoc[Execute a Spark Job] and xref:first_steps.adoc#_verify_that_it_works[verify that it works] by inspecting the pod logs. \ No newline at end of file diff --git a/docs/templating_vars.yaml b/docs/templating_vars.yaml new file mode 100644 index 00000000..8cabc9d8 --- /dev/null +++ b/docs/templating_vars.yaml @@ -0,0 +1,8 @@ +--- +helm: + repo_name: stackable-dev + repo_url: https://repo.stackable.tech/repository/helm-dev/ +versions: + commons: 0.3.0-nightly + secret: 0.6.0-nightly + spark: 0.5.0-nightly diff --git a/rust/crd/src/lib.rs b/rust/crd/src/lib.rs index 0b3ecd2e..5fdf07d8 100644 --- a/rust/crd/src/lib.rs +++ b/rust/crd/src/lib.rs @@ -109,7 +109,7 @@ pub enum ImagePullPolicy { Never, } -#[derive(Clone, Debug, Default, Deserialize, JsonSchema, PartialEq, Serialize)] +#[derive(Clone, Debug, Default, Deserialize, JsonSchema, PartialEq, Eq, Serialize)] #[serde(rename_all = "camelCase")] pub struct JobDependencies { #[serde(default, skip_serializing_if = "Option::is_none")] diff --git a/scripts/docs_templating.sh b/scripts/docs_templating.sh new file mode 100755 index 00000000..66910d9f --- /dev/null +++ b/scripts/docs_templating.sh @@ -0,0 +1,40 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Reads a file with variables to insert into templates, and templates all .*.j2 files +# in the 'docs' directory. +# +# dependencies +# pip install jinja2-cli + +docs_dir=../docs +templating_vars_file=$docs_dir/templating_vars.yaml + +# Check if files need templating +if [[ $(find "$docs_dir" | grep --count .j2\$) -eq "0" ]]; +then + echo "No files need templating, exiting." + exit +fi + +# Check if jinja2 is there +if ! command -v jinja2 &> /dev/null +then + echo "jinja2 could not be found. Use 'pip install jinja2-cli' to install it." + exit +fi + +# Check if templating vars file exists +if [[ ! -f "$templating_vars_file" ]]; +then + echo "$templating_vars_file does not exist, cannot start templating." +fi + +for file in $(find "$docs_dir" | grep .j2\$) +do + new_file_name=$(echo "$file" | sed 's/\(.*\).j2/\1/g') # cut of the '.j2' + echo "templating $new_file_name" + jinja2 "$file" "$templating_vars_file" -o "$new_file_name" +done + +echo "done"