docs: wording improvements, better install instructions (#470)

fhennig · maltesander · web-flow · commit e3d877fb2efd · 2024-09-30T09:07:02.000Z
* ~

* fix whitespace

* Apply suggestions from code review

---------

Co-authored-by: Malte Sander &lt;malte.sander.it@gmail.com&gt;
diff --git a/docs/modules/spark-k8s/pages/getting_started/first_steps.adoc b/docs/modules/spark-k8s/pages/getting_started/first_steps.adoc
@@ -8,7 +8,7 @@ Afterwards you can <<_verify_that_it_works, verify that it works>> by looking at
 
 A Spark application is made of up three components:
 
-* Job: this will build a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
+* Job: this builds a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
 * Driver: the driver starts the designated number of executors and removes them when the job is completed.
 * Executor(s): responsible for executing the job itself
 
@@ -25,20 +25,21 @@ Where:
 * `spec.version`: SparkApplication version (1.0). This can be freely set by the users and is added by the operator as label to all workload resources created by the application.
 * `spec.sparkImage`: the image used by the job, driver and executor pods. This can be a custom image built by the user or an official Stackable image. Available official images are listed in the Stackable https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%spark-k8s%2Ftags[image registry].
 * `spec.mode`: only `cluster` is currently supported
-* `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. This path is relative to the image, so in this case we are running an example python script (that calculates the value of pi): it is bundled with the Spark code and therefore already present in the job image
+* `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
+  This path is relative to the image, so in this case an example python script (that calculates the value of pi) is running: it is bundled with the Spark code and therefore already present in the job image
 * `spec.driver`: driver-specific settings.
 * `spec.executor`: executor-specific settings.
 
 == Verify that it works
 
-As mentioned above, the SparkApplication that has just been created will build a `spark-submit` command and pass it to the driver Pod, which in turn will create executor Pods that run for the duration of the job before being clean up.
-A running process will look like this:
+As mentioned above, the SparkApplication that has just been created builds a `spark-submit` command and pass it to the driver Pod, which in turn creates executor Pods that run for the duration of the job before being clean up.
+A running process looks like this:
 
 image::getting_started/spark_running.png[Spark job]
 
 * `pyspark-pi-xxxx`: this is the initializing job that creates the spark-submit command (named as `metadata.name` with a unique suffix)
 * `pyspark-pi-xxxxxxx-driver`: the driver pod that drives the execution
-* `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in our example `spec.executor.instances` was set to 3 which is why we have 3 executors)
+* `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in the example `spec.executor.instances` was set to 3 which is why 3 executors are running)
 
 Job progress can be followed by issuing this command:
 
@@ -48,11 +49,11 @@ include::example$getting_started/getting_started.sh[tag=wait-for-job]
 
 When the job completes the driver cleans up the executor.
 The initial job is persisted for several minutes before being removed.
-The completed state will look like this:
+The completed state looks like this:
 
 image::getting_started/spark_complete.png[Completed job]
 
 The driver logs can be inspected for more information about the results of the job.
-In this case we expect to find the results of our (approximate!) pi calculation:
+In this case the result of our (approximate!) pi calculation can be found:
 
 image::getting_started/spark_log.png[Driver log]
diff --git a/docs/modules/spark-k8s/pages/getting_started/index.adoc b/docs/modules/spark-k8s/pages/getting_started/index.adoc
@@ -1,11 +1,11 @@
 = Getting started
 
-This guide will get you started with Spark using the Stackable Operator for Apache Spark.
-It will guide you through the installation of the Operator and its dependencies, executing your first Spark job and reviewing its result.
+This guide gets you started with Spark using the Stackable operator for Apache Spark.
+It guides you through the installation of the operator and its dependencies, executing your first Spark job and reviewing its result.
 
 == Prerequisites
 
-You will need:
+You need:
 
 * a Kubernetes cluster
 * kubectl
diff --git a/docs/modules/spark-k8s/pages/getting_started/installation.adoc b/docs/modules/spark-k8s/pages/getting_started/installation.adoc
@@ -1,7 +1,7 @@
 = Installation
 :description: Learn how to set up Spark with the Stackable Operator, from installation to running your first job, including prerequisites and resource recommendations.
 
-On this page you will install the Stackable Spark-on-Kubernetes operator as well as the commons, secret and listener operators
+Install the Stackable Spark operator as well as the commons, secret and listener operators
 which are required by all Stackable operators.
 
 == Dependencies
@@ -18,24 +18,26 @@ More information about the different ways to define Spark jobs and their depende
 
 == Stackable Operators
 
-There are 2 ways to install Stackable operators
+There are multiple ways to install the Stackable Operator for Apache Spark.
+xref:management:stackablectl:index.adoc[] is the preferred way, but Helm is also supported.
+OpenShift users may prefer installing the operator from the RedHat Certified Operator catalog using the OpenShift web console.
 
-. Using xref:management:stackablectl:index.adoc[]
-. Using a Helm chart
-
-=== stackablectl
-
-`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install Operators.
+[tabs]
+====
+stackablectl::
++
+--
+`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install operators.
 Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform.
 
-After you have installed `stackablectl` run the following command to install the Spark-k8s operator:
+After you have installed `stackablectl` run the following command to install the Spark operator:
 
 [source,bash]
 ----
 include::example$getting_started/getting_started.sh[tag=stackablectl-install-operators]
 ----
 
-The tool will show
+The tool shows
 
 [source]
 ----
@@ -44,24 +46,26 @@ include::example$getting_started/install_output.txt[]
 
 TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use stackablectl.
 For example, you can use the `--cluster kind` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].
+--
 
-=== Helm
-
-You can also use Helm to install the operator.
+Helm::
++
+--
 Add the Stackable Helm repository:
 [source,bash]
 ----
 include::example$getting_started/getting_started.sh[tag=helm-add-repo]
 ----
 
-Then install the Stackable Operators:
+Install the Stackable Operators:
 [source,bash]
 ----
 include::example$getting_started/getting_started.sh[tag=helm-install-operators]
 ----
 
 Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the SparkApplication (as well as the CRDs for the required operators).
-You are now ready to create a Spark job.
+--
+====
 
 == What's next
 
diff --git a/docs/modules/spark-k8s/pages/index.adoc b/docs/modules/spark-k8s/pages/index.adoc
@@ -22,7 +22,7 @@ Its in-memory processing and fault-tolerant architecture make it ideal for a var
 == Getting started
 
 Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable operator.
-The guide will lead you through the installation of the operator and running your first Spark application on Kubernetes.
+The guide leads you through the installation of the operator and running your first Spark application on Kubernetes.
 
 == How the operator works
 
@@ -62,7 +62,7 @@ A ConfigMap supplies the necessary configuration, and there is a service to conn
 The {spark-rbac}[Spark-Kubernetes RBAC documentation] describes what is needed for `spark-submit` jobs to run successfully:
 minimally a role/cluster-role to allow the driver pod to create and manage executor pods.
 
-However, to add security each `spark-submit` job launched by the operator will be assigned its own ServiceAccount.
+However, to add security each `spark-submit` job launched by the operator is assigned its own ServiceAccount.
 
 During the operator installation, a cluster role named `spark-k8s-clusterrole` is created with pre-defined permissions.
 
diff --git a/docs/modules/spark-k8s/pages/reference/commandline-parameters.adoc b/docs/modules/spark-k8s/pages/reference/commandline-parameters.adoc
@@ -10,7 +10,7 @@ This operator accepts the following command line parameters:
 
 *Multiple values:* false
 
-The operator will **only** watch for resources in the provided namespace `test`:
+The operator **only** watches for resources in the provided namespace `test`:
 
 [source]
 ----
diff --git a/docs/modules/spark-k8s/pages/reference/environment-variables.adoc b/docs/modules/spark-k8s/pages/reference/environment-variables.adoc
@@ -10,7 +10,7 @@ This operator accepts the following environment variables:
 
 *Multiple values:* false
 
-The operator will **only** watch for resources in the provided namespace `test`:
+The operator **only** watches for resources in the provided namespace `test`:
 
 [source]
 ----
diff --git a/docs/modules/spark-k8s/pages/usage-guide/examples.adoc b/docs/modules/spark-k8s/pages/usage-guide/examples.adoc
@@ -4,7 +4,8 @@
 The following examples have the following `spec` fields in common:
 
 * `version`: the current version is "1.0"
-* `sparkImage`: the docker image that will be used by job, driver and executor pods. This can be provided by the user.
+* `sparkImage`: the docker image that is used by job, driver and executor pods.
+  This can be provided by the user.
 * `mode`: only `cluster` is currently supported
 * `mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
 * `args`: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.
@@ -22,10 +23,10 @@ Job-specific settings are annotated below.
 include::example$example-sparkapp-image.yaml[]
 ----
 
-<1> Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC
+<1> Job image: this contains the job artifact that is retrieved from the volume mount backed by the PVC
 <2> Job python artifact (local)
 <3> Job argument (external)
-<4> List of python job requirements: these will be installed in the pods via `pip`
+<4> List of python job requirements: these are installed in the Pods via `pip`.
 <5> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)
 
 == JVM (Scala): externally located artifact and dataset
@@ -34,7 +35,6 @@ include::example$example-sparkapp-image.yaml[]
 ----
 include::example$example-sparkapp-pvc.yaml[]
 ----
-
 <1> Job artifact located on S3.
 <2> Job main class
 <3> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)
@@ -70,5 +70,5 @@ include::example$example-sparkapp-configmap.yaml[]
 <3> Job scala artifact that requires an input argument
 <4> The volume backed by the configuration map
 <5> The expected job argument, accessed via the mounted configuration map file
-<6> The name of the volume backed by the configuration map that will be mounted to the driver/executor
-<7> The mount location of the volume (this will contain a file `/arguments/job-args.txt`)
+<6> The name of the volume backed by the configuration map that is mounted to the driver/executor
+<7> The mount location of the volume (this contains a file `/arguments/job-args.txt`)
diff --git a/docs/modules/spark-k8s/pages/usage-guide/history-server.adoc b/docs/modules/spark-k8s/pages/usage-guide/history-server.adoc
@@ -2,10 +2,8 @@
 :description: Set up Spark History Server on Kubernetes to access Spark logs via S3, with configuration for cleanups and web UI access details.
 :page-aliases: history_server.adoc
 
-== Overview
-
 The Stackable Spark-on-Kubernetes operator runs Apache Spark workloads in a Kubernetes cluster, whereby driver- and executor-pods are created for the duration of the job and then terminated.
-One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an end-point for spark logging, so that job information can be viewed once the job pods are no longer available.
+One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an endpoint for Spark logging, so that job information can be viewed once the job pods are no longer available.
 
 == Deployment
 
@@ -14,25 +12,30 @@ The event logs are loaded from an S3 bucket named `spark-logs` and the folder `e
 The credentials for this bucket are provided by the secret class `s3-credentials-class`.
 For more details on how the Stackable Data Platform manages S3 resources see the xref:concepts:s3.adoc[S3 resources] page.
 
-
 [source,yaml]
 ----
 include::example$example-history-server.yaml[]
 ----
 
-<1> The location of the event logs. Must be a S3 bucket. Future implementations might add support for other shared filesystems such as HDFS.
-<2> Folder within the S3 bucket where the log files are located. This folder is required and must exist before setting up the history server.
+<1> The location of the event logs.
+    Must be an S3 bucket.
+    Future implementations might add support for other shared filesystems such as HDFS.
+<2> Directory within the S3 bucket where the log files are located.
+    This directory is required and must exist before setting up the history server.
 <3> The S3 bucket definition, here provided in-line.
-<4> Additional history server configuration properties can be provided here as a map. For possible properties see: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options
-<5> This deployment has only one Pod. Multiple history servers can be started, all reading the same event logs by increasing the replica count.
-<6> This history server will automatically clean up old log files by using default properties. You can change any of these by using the `sparkConf` map.
+<4> Additional history server configuration properties can be provided here as a map.
+    For possible properties see: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options
+<5> This deployment has only one Pod.
+    Multiple history servers can be started, all reading the same event logs by increasing the replica count.
+<6> This history server automatically cleans up old log files by using default properties.
+    Change any of these by using the `sparkConf` map.
 
 NOTE: Only one role group can have scheduled cleanups enabled (`cleaner: true`) and this role group cannot have more than 1 replica.
 
 The secret with S3 credentials must contain at least the following two keys:
 
-* `accessKey` - the access key of a user with read and write access to the event log bucket.
-* `secretKey` - the secret key of a user with read and write access to the event log bucket.
+* `accessKey` -- the access key of a user with read and write access to the event log bucket.
+* `secretKey` -- the secret key of a user with read and write access to the event log bucket.
 
 Any other entries of the Secret are ignored by the operator.
 
diff --git a/docs/modules/spark-k8s/pages/usage-guide/listenerclass.adoc b/docs/modules/spark-k8s/pages/usage-guide/listenerclass.adoc
@@ -1,8 +1,11 @@
 = Service exposition with ListenerClasses
 
-The Spark Operator deploys SparkApplications, and does not offer a UI or other API, so no services are exposed. However, the Operator can also deploy HistoryServers, which do offer a UI and API.  The Operator deploys a service called `<name>-historyserver` (where `<name>` is the name of the HistoryServer) through which HistoryServer can be reached.
+The Spark operator deploys SparkApplications, and does not offer a UI or other API, so no services are exposed.
+However, the operator can also deploy HistoryServers, which do offer a UI and API.
+The operator deploys a service called `<name>-historyserver` (where `<name>` is the name of the spark application) through which the HistoryServer can be reached.
 
-This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`. Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.
+This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`.
+Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.
 
 This is how the ListenerClass is configured:
 
diff --git a/docs/modules/spark-k8s/pages/usage-guide/logging.adoc b/docs/modules/spark-k8s/pages/usage-guide/logging.adoc
@@ -1,6 +1,8 @@
 = Logging
 
-The Spark operator installs a https://vector.dev/docs/setup/deployment/roles/#agent[vector agent] as a side-car container in every application Pod except the `job` Pod that runs `spark-submit`. It also configures the logging framework to output logs in XML format. This is the same https://logging.apache.org/log4j/2.x/manual/layouts.html#XMLLayout[format] used across all Stackable products and it enables the https://vector.dev/docs/setup/deployment/roles/#aggregator[vector aggregator] to collect logs across the entire platform.
+The Spark operator installs a https://vector.dev/docs/setup/deployment/roles/#agent[vector agent] as a side-car container in every application Pod except the `job` Pod that runs `spark-submit`.
+It also configures the logging framework to output logs in XML format.
+This is the same https://logging.apache.org/log4j/2.x/manual/layouts.html#XMLLayout[format] used across all Stackable products, and it enables the https://vector.dev/docs/setup/deployment/roles/#aggregator[vector aggregator] to collect logs across the entire platform.
 
 It is the user's responsibility to install and configure the vector aggregator, but the agents can discover the aggregator automatically using a discovery ConfigMap as described in the xref:concepts:logging.adoc[logging concepts].
 
@@ -35,12 +37,12 @@ spec:
                     level: INFO
 ...
 ----
-<1> Name of a ConfigMap that referenced the vector aggregator. See example below.
+<1> Name of a ConfigMap that referenced the vector aggregator.
+    See example below.
 <2> Enable the vector agent in the history pod.
 <3> Configure log levels for file and console outputs.
 
-Example vector aggregator configuration.
-
+.Example vector aggregator configuration
 [source,yaml]
 ----
 ---
diff --git a/docs/modules/spark-k8s/pages/usage-guide/operations/applications.adoc b/docs/modules/spark-k8s/pages/usage-guide/operations/applications.adoc
@@ -1,7 +1,10 @@
-= Spark Applications
+= Spark applications
 
-Spark applications are submitted to the Spark Operator as SparkApplication resources. These resources are used to define the configuration of the Spark job, including the image to use, the main application file, and the number of executors to start.
+Spark applications are submitted to the Spark Operator as SparkApplication resources.
+These resources are used to define the configuration of the Spark job, including the image to use, the main application file, and the number of executors to start.
 
-Upon creation, the application's status set to `Unknown`. As the operator creates the necessary resources, the status of the application transitions through different phases that reflect the phase of the driver Pod. A successful application will eventually reach the `Succeeded` phase.
+Upon creation, the application's status set to `Unknown`.
+As the operator creates the necessary resources, the status of the application transitions through different phases that reflect the phase of the driver Pod. A successful application eventually reaches the `Succeeded` phase.
 
-NOTE: The operator will never reconcile an application once it has been created. To resubmit an application, a new SparkApplication resource must be created.
+NOTE: The operator never reconciles an application once it has been created.
+To resubmit an application, a new SparkApplication resource must be created.
diff --git a/docs/modules/spark-k8s/pages/usage-guide/resources.adoc b/docs/modules/spark-k8s/pages/usage-guide/resources.adoc
diff --git a/docs/modules/spark-k8s/pages/usage-guide/s3.adoc b/docs/modules/spark-k8s/pages/usage-guide/s3.adoc