Skip to content

Commit e3d877f

Browse files
fhennigmaltesander
andauthored
docs: wording improvements, better install instructions (#470)
* ~ * fix whitespace * Apply suggestions from code review --------- Co-authored-by: Malte Sander <[email protected]>
1 parent c94300f commit e3d877f

13 files changed

+85
-63
lines changed

docs/modules/spark-k8s/pages/getting_started/first_steps.adoc

+8-7
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Afterwards you can <<_verify_that_it_works, verify that it works>> by looking at
88

99
A Spark application is made of up three components:
1010

11-
* Job: this will build a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
11+
* Job: this builds a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
1212
* Driver: the driver starts the designated number of executors and removes them when the job is completed.
1313
* Executor(s): responsible for executing the job itself
1414

@@ -25,20 +25,21 @@ Where:
2525
* `spec.version`: SparkApplication version (1.0). This can be freely set by the users and is added by the operator as label to all workload resources created by the application.
2626
* `spec.sparkImage`: the image used by the job, driver and executor pods. This can be a custom image built by the user or an official Stackable image. Available official images are listed in the Stackable https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%spark-k8s%2Ftags[image registry].
2727
* `spec.mode`: only `cluster` is currently supported
28-
* `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. This path is relative to the image, so in this case we are running an example python script (that calculates the value of pi): it is bundled with the Spark code and therefore already present in the job image
28+
* `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
29+
This path is relative to the image, so in this case an example python script (that calculates the value of pi) is running: it is bundled with the Spark code and therefore already present in the job image
2930
* `spec.driver`: driver-specific settings.
3031
* `spec.executor`: executor-specific settings.
3132

3233
== Verify that it works
3334

34-
As mentioned above, the SparkApplication that has just been created will build a `spark-submit` command and pass it to the driver Pod, which in turn will create executor Pods that run for the duration of the job before being clean up.
35-
A running process will look like this:
35+
As mentioned above, the SparkApplication that has just been created builds a `spark-submit` command and pass it to the driver Pod, which in turn creates executor Pods that run for the duration of the job before being clean up.
36+
A running process looks like this:
3637

3738
image::getting_started/spark_running.png[Spark job]
3839

3940
* `pyspark-pi-xxxx`: this is the initializing job that creates the spark-submit command (named as `metadata.name` with a unique suffix)
4041
* `pyspark-pi-xxxxxxx-driver`: the driver pod that drives the execution
41-
* `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in our example `spec.executor.instances` was set to 3 which is why we have 3 executors)
42+
* `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in the example `spec.executor.instances` was set to 3 which is why 3 executors are running)
4243

4344
Job progress can be followed by issuing this command:
4445

@@ -48,11 +49,11 @@ include::example$getting_started/getting_started.sh[tag=wait-for-job]
4849

4950
When the job completes the driver cleans up the executor.
5051
The initial job is persisted for several minutes before being removed.
51-
The completed state will look like this:
52+
The completed state looks like this:
5253

5354
image::getting_started/spark_complete.png[Completed job]
5455

5556
The driver logs can be inspected for more information about the results of the job.
56-
In this case we expect to find the results of our (approximate!) pi calculation:
57+
In this case the result of our (approximate!) pi calculation can be found:
5758

5859
image::getting_started/spark_log.png[Driver log]

docs/modules/spark-k8s/pages/getting_started/index.adoc

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
= Getting started
22

3-
This guide will get you started with Spark using the Stackable Operator for Apache Spark.
4-
It will guide you through the installation of the Operator and its dependencies, executing your first Spark job and reviewing its result.
3+
This guide gets you started with Spark using the Stackable operator for Apache Spark.
4+
It guides you through the installation of the operator and its dependencies, executing your first Spark job and reviewing its result.
55

66
== Prerequisites
77

8-
You will need:
8+
You need:
99

1010
* a Kubernetes cluster
1111
* kubectl

docs/modules/spark-k8s/pages/getting_started/installation.adoc

+19-15
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
= Installation
22
:description: Learn how to set up Spark with the Stackable Operator, from installation to running your first job, including prerequisites and resource recommendations.
33

4-
On this page you will install the Stackable Spark-on-Kubernetes operator as well as the commons, secret and listener operators
4+
Install the Stackable Spark operator as well as the commons, secret and listener operators
55
which are required by all Stackable operators.
66

77
== Dependencies
@@ -18,24 +18,26 @@ More information about the different ways to define Spark jobs and their depende
1818

1919
== Stackable Operators
2020

21-
There are 2 ways to install Stackable operators
21+
There are multiple ways to install the Stackable Operator for Apache Spark.
22+
xref:management:stackablectl:index.adoc[] is the preferred way, but Helm is also supported.
23+
OpenShift users may prefer installing the operator from the RedHat Certified Operator catalog using the OpenShift web console.
2224

23-
. Using xref:management:stackablectl:index.adoc[]
24-
. Using a Helm chart
25-
26-
=== stackablectl
27-
28-
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install Operators.
25+
[tabs]
26+
====
27+
stackablectl::
28+
+
29+
--
30+
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install operators.
2931
Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform.
3032
31-
After you have installed `stackablectl` run the following command to install the Spark-k8s operator:
33+
After you have installed `stackablectl` run the following command to install the Spark operator:
3234
3335
[source,bash]
3436
----
3537
include::example$getting_started/getting_started.sh[tag=stackablectl-install-operators]
3638
----
3739
38-
The tool will show
40+
The tool shows
3941
4042
[source]
4143
----
@@ -44,24 +46,26 @@ include::example$getting_started/install_output.txt[]
4446
4547
TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use stackablectl.
4648
For example, you can use the `--cluster kind` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].
49+
--
4750
48-
=== Helm
49-
50-
You can also use Helm to install the operator.
51+
Helm::
52+
+
53+
--
5154
Add the Stackable Helm repository:
5255
[source,bash]
5356
----
5457
include::example$getting_started/getting_started.sh[tag=helm-add-repo]
5558
----
5659
57-
Then install the Stackable Operators:
60+
Install the Stackable Operators:
5861
[source,bash]
5962
----
6063
include::example$getting_started/getting_started.sh[tag=helm-install-operators]
6164
----
6265
6366
Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the SparkApplication (as well as the CRDs for the required operators).
64-
You are now ready to create a Spark job.
67+
--
68+
====
6569

6670
== What's next
6771

docs/modules/spark-k8s/pages/index.adoc

+2-2
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Its in-memory processing and fault-tolerant architecture make it ideal for a var
2222
== Getting started
2323

2424
Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable operator.
25-
The guide will lead you through the installation of the operator and running your first Spark application on Kubernetes.
25+
The guide leads you through the installation of the operator and running your first Spark application on Kubernetes.
2626

2727
== How the operator works
2828

@@ -62,7 +62,7 @@ A ConfigMap supplies the necessary configuration, and there is a service to conn
6262
The {spark-rbac}[Spark-Kubernetes RBAC documentation] describes what is needed for `spark-submit` jobs to run successfully:
6363
minimally a role/cluster-role to allow the driver pod to create and manage executor pods.
6464

65-
However, to add security each `spark-submit` job launched by the operator will be assigned its own ServiceAccount.
65+
However, to add security each `spark-submit` job launched by the operator is assigned its own ServiceAccount.
6666

6767
During the operator installation, a cluster role named `spark-k8s-clusterrole` is created with pre-defined permissions.
6868

docs/modules/spark-k8s/pages/reference/commandline-parameters.adoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ This operator accepts the following command line parameters:
1010

1111
*Multiple values:* false
1212

13-
The operator will **only** watch for resources in the provided namespace `test`:
13+
The operator **only** watches for resources in the provided namespace `test`:
1414

1515
[source]
1616
----

docs/modules/spark-k8s/pages/reference/environment-variables.adoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ This operator accepts the following environment variables:
1010

1111
*Multiple values:* false
1212

13-
The operator will **only** watch for resources in the provided namespace `test`:
13+
The operator **only** watches for resources in the provided namespace `test`:
1414

1515
[source]
1616
----

docs/modules/spark-k8s/pages/usage-guide/examples.adoc

+6-6
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44
The following examples have the following `spec` fields in common:
55

66
* `version`: the current version is "1.0"
7-
* `sparkImage`: the docker image that will be used by job, driver and executor pods. This can be provided by the user.
7+
* `sparkImage`: the docker image that is used by job, driver and executor pods.
8+
This can be provided by the user.
89
* `mode`: only `cluster` is currently supported
910
* `mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
1011
* `args`: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.
@@ -22,10 +23,10 @@ Job-specific settings are annotated below.
2223
include::example$example-sparkapp-image.yaml[]
2324
----
2425

25-
<1> Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC
26+
<1> Job image: this contains the job artifact that is retrieved from the volume mount backed by the PVC
2627
<2> Job python artifact (local)
2728
<3> Job argument (external)
28-
<4> List of python job requirements: these will be installed in the pods via `pip`
29+
<4> List of python job requirements: these are installed in the Pods via `pip`.
2930
<5> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)
3031

3132
== JVM (Scala): externally located artifact and dataset
@@ -34,7 +35,6 @@ include::example$example-sparkapp-image.yaml[]
3435
----
3536
include::example$example-sparkapp-pvc.yaml[]
3637
----
37-
3838
<1> Job artifact located on S3.
3939
<2> Job main class
4040
<3> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)
@@ -70,5 +70,5 @@ include::example$example-sparkapp-configmap.yaml[]
7070
<3> Job scala artifact that requires an input argument
7171
<4> The volume backed by the configuration map
7272
<5> The expected job argument, accessed via the mounted configuration map file
73-
<6> The name of the volume backed by the configuration map that will be mounted to the driver/executor
74-
<7> The mount location of the volume (this will contain a file `/arguments/job-args.txt`)
73+
<6> The name of the volume backed by the configuration map that is mounted to the driver/executor
74+
<7> The mount location of the volume (this contains a file `/arguments/job-args.txt`)

docs/modules/spark-k8s/pages/usage-guide/history-server.adoc

+14-11
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,8 @@
22
:description: Set up Spark History Server on Kubernetes to access Spark logs via S3, with configuration for cleanups and web UI access details.
33
:page-aliases: history_server.adoc
44

5-
== Overview
6-
75
The Stackable Spark-on-Kubernetes operator runs Apache Spark workloads in a Kubernetes cluster, whereby driver- and executor-pods are created for the duration of the job and then terminated.
8-
One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an end-point for spark logging, so that job information can be viewed once the job pods are no longer available.
6+
One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an endpoint for Spark logging, so that job information can be viewed once the job pods are no longer available.
97

108
== Deployment
119

@@ -14,25 +12,30 @@ The event logs are loaded from an S3 bucket named `spark-logs` and the folder `e
1412
The credentials for this bucket are provided by the secret class `s3-credentials-class`.
1513
For more details on how the Stackable Data Platform manages S3 resources see the xref:concepts:s3.adoc[S3 resources] page.
1614

17-
1815
[source,yaml]
1916
----
2017
include::example$example-history-server.yaml[]
2118
----
2219

23-
<1> The location of the event logs. Must be a S3 bucket. Future implementations might add support for other shared filesystems such as HDFS.
24-
<2> Folder within the S3 bucket where the log files are located. This folder is required and must exist before setting up the history server.
20+
<1> The location of the event logs.
21+
Must be an S3 bucket.
22+
Future implementations might add support for other shared filesystems such as HDFS.
23+
<2> Directory within the S3 bucket where the log files are located.
24+
This directory is required and must exist before setting up the history server.
2525
<3> The S3 bucket definition, here provided in-line.
26-
<4> Additional history server configuration properties can be provided here as a map. For possible properties see: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options
27-
<5> This deployment has only one Pod. Multiple history servers can be started, all reading the same event logs by increasing the replica count.
28-
<6> This history server will automatically clean up old log files by using default properties. You can change any of these by using the `sparkConf` map.
26+
<4> Additional history server configuration properties can be provided here as a map.
27+
For possible properties see: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options
28+
<5> This deployment has only one Pod.
29+
Multiple history servers can be started, all reading the same event logs by increasing the replica count.
30+
<6> This history server automatically cleans up old log files by using default properties.
31+
Change any of these by using the `sparkConf` map.
2932

3033
NOTE: Only one role group can have scheduled cleanups enabled (`cleaner: true`) and this role group cannot have more than 1 replica.
3134

3235
The secret with S3 credentials must contain at least the following two keys:
3336

34-
* `accessKey` - the access key of a user with read and write access to the event log bucket.
35-
* `secretKey` - the secret key of a user with read and write access to the event log bucket.
37+
* `accessKey` -- the access key of a user with read and write access to the event log bucket.
38+
* `secretKey` -- the secret key of a user with read and write access to the event log bucket.
3639

3740
Any other entries of the Secret are ignored by the operator.
3841

docs/modules/spark-k8s/pages/usage-guide/listenerclass.adoc

+5-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
= Service exposition with ListenerClasses
22

3-
The Spark Operator deploys SparkApplications, and does not offer a UI or other API, so no services are exposed. However, the Operator can also deploy HistoryServers, which do offer a UI and API. The Operator deploys a service called `<name>-historyserver` (where `<name>` is the name of the HistoryServer) through which HistoryServer can be reached.
3+
The Spark operator deploys SparkApplications, and does not offer a UI or other API, so no services are exposed.
4+
However, the operator can also deploy HistoryServers, which do offer a UI and API.
5+
The operator deploys a service called `<name>-historyserver` (where `<name>` is the name of the spark application) through which the HistoryServer can be reached.
46

5-
This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`. Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.
7+
This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`.
8+
Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.
69

710
This is how the ListenerClass is configured:
811

docs/modules/spark-k8s/pages/usage-guide/logging.adoc

+6-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
= Logging
22

3-
The Spark operator installs a https://vector.dev/docs/setup/deployment/roles/#agent[vector agent] as a side-car container in every application Pod except the `job` Pod that runs `spark-submit`. It also configures the logging framework to output logs in XML format. This is the same https://logging.apache.org/log4j/2.x/manual/layouts.html#XMLLayout[format] used across all Stackable products and it enables the https://vector.dev/docs/setup/deployment/roles/#aggregator[vector aggregator] to collect logs across the entire platform.
3+
The Spark operator installs a https://vector.dev/docs/setup/deployment/roles/#agent[vector agent] as a side-car container in every application Pod except the `job` Pod that runs `spark-submit`.
4+
It also configures the logging framework to output logs in XML format.
5+
This is the same https://logging.apache.org/log4j/2.x/manual/layouts.html#XMLLayout[format] used across all Stackable products, and it enables the https://vector.dev/docs/setup/deployment/roles/#aggregator[vector aggregator] to collect logs across the entire platform.
46

57
It is the user's responsibility to install and configure the vector aggregator, but the agents can discover the aggregator automatically using a discovery ConfigMap as described in the xref:concepts:logging.adoc[logging concepts].
68

@@ -35,12 +37,12 @@ spec:
3537
level: INFO
3638
...
3739
----
38-
<1> Name of a ConfigMap that referenced the vector aggregator. See example below.
40+
<1> Name of a ConfigMap that referenced the vector aggregator.
41+
See example below.
3942
<2> Enable the vector agent in the history pod.
4043
<3> Configure log levels for file and console outputs.
4144

42-
Example vector aggregator configuration.
43-
45+
.Example vector aggregator configuration
4446
[source,yaml]
4547
----
4648
---
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
1-
= Spark Applications
1+
= Spark applications
22

3-
Spark applications are submitted to the Spark Operator as SparkApplication resources. These resources are used to define the configuration of the Spark job, including the image to use, the main application file, and the number of executors to start.
3+
Spark applications are submitted to the Spark Operator as SparkApplication resources.
4+
These resources are used to define the configuration of the Spark job, including the image to use, the main application file, and the number of executors to start.
45

5-
Upon creation, the application's status set to `Unknown`. As the operator creates the necessary resources, the status of the application transitions through different phases that reflect the phase of the driver Pod. A successful application will eventually reach the `Succeeded` phase.
6+
Upon creation, the application's status set to `Unknown`.
7+
As the operator creates the necessary resources, the status of the application transitions through different phases that reflect the phase of the driver Pod. A successful application eventually reaches the `Succeeded` phase.
68

7-
NOTE: The operator will never reconcile an application once it has been created. To resubmit an application, a new SparkApplication resource must be created.
9+
NOTE: The operator never reconciles an application once it has been created.
10+
To resubmit an application, a new SparkApplication resource must be created.

0 commit comments

Comments
 (0)