You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/modules/spark-k8s/pages/getting_started/first_steps.adoc
+8-7
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ Afterwards you can <<_verify_that_it_works, verify that it works>> by looking at
8
8
9
9
A Spark application is made of up three components:
10
10
11
-
* Job: this will build a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
11
+
* Job: this builds a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
12
12
* Driver: the driver starts the designated number of executors and removes them when the job is completed.
13
13
* Executor(s): responsible for executing the job itself
14
14
@@ -25,20 +25,21 @@ Where:
25
25
* `spec.version`: SparkApplication version (1.0). This can be freely set by the users and is added by the operator as label to all workload resources created by the application.
26
26
* `spec.sparkImage`: the image used by the job, driver and executor pods. This can be a custom image built by the user or an official Stackable image. Available official images are listed in the Stackable https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%spark-k8s%2Ftags[image registry].
27
27
* `spec.mode`: only `cluster` is currently supported
28
-
* `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. This path is relative to the image, so in this case we are running an example python script (that calculates the value of pi): it is bundled with the Spark code and therefore already present in the job image
28
+
* `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
29
+
This path is relative to the image, so in this case an example python script (that calculates the value of pi) is running: it is bundled with the Spark code and therefore already present in the job image
29
30
* `spec.driver`: driver-specific settings.
30
31
* `spec.executor`: executor-specific settings.
31
32
32
33
== Verify that it works
33
34
34
-
As mentioned above, the SparkApplication that has just been created will build a `spark-submit` command and pass it to the driver Pod, which in turn will create executor Pods that run for the duration of the job before being clean up.
35
-
A running process will look like this:
35
+
As mentioned above, the SparkApplication that has just been created builds a `spark-submit` command and pass it to the driver Pod, which in turn creates executor Pods that run for the duration of the job before being clean up.
* `pyspark-pi-xxxx`: this is the initializing job that creates the spark-submit command (named as `metadata.name` with a unique suffix)
40
41
* `pyspark-pi-xxxxxxx-driver`: the driver pod that drives the execution
41
-
* `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in our example `spec.executor.instances` was set to 3 which is why we have 3 executors)
42
+
* `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in the example `spec.executor.instances` was set to 3 which is why 3 executors are running)
42
43
43
44
Job progress can be followed by issuing this command:
Copy file name to clipboardExpand all lines: docs/modules/spark-k8s/pages/getting_started/installation.adoc
+19-15
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
= Installation
2
2
:description: Learn how to set up Spark with the Stackable Operator, from installation to running your first job, including prerequisites and resource recommendations.
3
3
4
-
On this page you will install the Stackable Spark-on-Kubernetes operator as well as the commons, secret and listener operators
4
+
Install the Stackable Spark operator as well as the commons, secret and listener operators
5
5
which are required by all Stackable operators.
6
6
7
7
== Dependencies
@@ -18,24 +18,26 @@ More information about the different ways to define Spark jobs and their depende
18
18
19
19
== Stackable Operators
20
20
21
-
There are 2 ways to install Stackable operators
21
+
There are multiple ways to install the Stackable Operator for Apache Spark.
22
+
xref:management:stackablectl:index.adoc[] is the preferred way, but Helm is also supported.
23
+
OpenShift users may prefer installing the operator from the RedHat Certified Operator catalog using the OpenShift web console.
22
24
23
-
. Using xref:management:stackablectl:index.adoc[]
24
-
. Using a Helm chart
25
-
26
-
=== stackablectl
27
-
28
-
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install Operators.
25
+
[tabs]
26
+
====
27
+
stackablectl::
28
+
+
29
+
--
30
+
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install operators.
29
31
Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform.
30
32
31
-
After you have installed `stackablectl` run the following command to install the Spark-k8s operator:
33
+
After you have installed `stackablectl` run the following command to install the Spark operator:
Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the SparkApplication (as well as the CRDs for the required operators).
Copy file name to clipboardExpand all lines: docs/modules/spark-k8s/pages/usage-guide/examples.adoc
+6-6
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,8 @@
4
4
The following examples have the following `spec` fields in common:
5
5
6
6
* `version`: the current version is "1.0"
7
-
* `sparkImage`: the docker image that will be used by job, driver and executor pods. This can be provided by the user.
7
+
* `sparkImage`: the docker image that is used by job, driver and executor pods.
8
+
This can be provided by the user.
8
9
* `mode`: only `cluster` is currently supported
9
10
* `mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
10
11
* `args`: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.
@@ -22,10 +23,10 @@ Job-specific settings are annotated below.
22
23
include::example$example-sparkapp-image.yaml[]
23
24
----
24
25
25
-
<1> Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC
26
+
<1> Job image: this contains the job artifact that is retrieved from the volume mount backed by the PVC
26
27
<2> Job python artifact (local)
27
28
<3> Job argument (external)
28
-
<4> List of python job requirements: these will be installed in the pods via `pip`
29
+
<4> List of python job requirements: these are installed in the Pods via `pip`.
29
30
<5> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)
30
31
31
32
== JVM (Scala): externally located artifact and dataset
<3> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)
Copy file name to clipboardExpand all lines: docs/modules/spark-k8s/pages/usage-guide/history-server.adoc
+14-11
Original file line number
Diff line number
Diff line change
@@ -2,10 +2,8 @@
2
2
:description: Set up Spark History Server on Kubernetes to access Spark logs via S3, with configuration for cleanups and web UI access details.
3
3
:page-aliases: history_server.adoc
4
4
5
-
== Overview
6
-
7
5
The Stackable Spark-on-Kubernetes operator runs Apache Spark workloads in a Kubernetes cluster, whereby driver- and executor-pods are created for the duration of the job and then terminated.
8
-
One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an end-point for spark logging, so that job information can be viewed once the job pods are no longer available.
6
+
One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an endpoint for Spark logging, so that job information can be viewed once the job pods are no longer available.
9
7
10
8
== Deployment
11
9
@@ -14,25 +12,30 @@ The event logs are loaded from an S3 bucket named `spark-logs` and the folder `e
14
12
The credentials for this bucket are provided by the secret class `s3-credentials-class`.
15
13
For more details on how the Stackable Data Platform manages S3 resources see the xref:concepts:s3.adoc[S3 resources] page.
16
14
17
-
18
15
[source,yaml]
19
16
----
20
17
include::example$example-history-server.yaml[]
21
18
----
22
19
23
-
<1> The location of the event logs. Must be a S3 bucket. Future implementations might add support for other shared filesystems such as HDFS.
24
-
<2> Folder within the S3 bucket where the log files are located. This folder is required and must exist before setting up the history server.
20
+
<1> The location of the event logs.
21
+
Must be an S3 bucket.
22
+
Future implementations might add support for other shared filesystems such as HDFS.
23
+
<2> Directory within the S3 bucket where the log files are located.
24
+
This directory is required and must exist before setting up the history server.
25
25
<3> The S3 bucket definition, here provided in-line.
26
-
<4> Additional history server configuration properties can be provided here as a map. For possible properties see: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options
27
-
<5> This deployment has only one Pod. Multiple history servers can be started, all reading the same event logs by increasing the replica count.
28
-
<6> This history server will automatically clean up old log files by using default properties. You can change any of these by using the `sparkConf` map.
26
+
<4> Additional history server configuration properties can be provided here as a map.
27
+
For possible properties see: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options
28
+
<5> This deployment has only one Pod.
29
+
Multiple history servers can be started, all reading the same event logs by increasing the replica count.
30
+
<6> This history server automatically cleans up old log files by using default properties.
31
+
Change any of these by using the `sparkConf` map.
29
32
30
33
NOTE: Only one role group can have scheduled cleanups enabled (`cleaner: true`) and this role group cannot have more than 1 replica.
31
34
32
35
The secret with S3 credentials must contain at least the following two keys:
33
36
34
-
* `accessKey` - the access key of a user with read and write access to the event log bucket.
35
-
* `secretKey` - the secret key of a user with read and write access to the event log bucket.
37
+
* `accessKey` -- the access key of a user with read and write access to the event log bucket.
38
+
* `secretKey` -- the secret key of a user with read and write access to the event log bucket.
36
39
37
40
Any other entries of the Secret are ignored by the operator.
Copy file name to clipboardExpand all lines: docs/modules/spark-k8s/pages/usage-guide/listenerclass.adoc
+5-2
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,11 @@
1
1
= Service exposition with ListenerClasses
2
2
3
-
The Spark Operator deploys SparkApplications, and does not offer a UI or other API, so no services are exposed. However, the Operator can also deploy HistoryServers, which do offer a UI and API. The Operator deploys a service called `<name>-historyserver` (where `<name>` is the name of the HistoryServer) through which HistoryServer can be reached.
3
+
The Spark operator deploys SparkApplications, and does not offer a UI or other API, so no services are exposed.
4
+
However, the operator can also deploy HistoryServers, which do offer a UI and API.
5
+
The operator deploys a service called `<name>-historyserver` (where `<name>` is the name of the spark application) through which the HistoryServer can be reached.
4
6
5
-
This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`. Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.
7
+
This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`.
8
+
Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.
Copy file name to clipboardExpand all lines: docs/modules/spark-k8s/pages/usage-guide/logging.adoc
+6-4
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,8 @@
1
1
= Logging
2
2
3
-
The Spark operator installs a https://vector.dev/docs/setup/deployment/roles/#agent[vector agent] as a side-car container in every application Pod except the `job` Pod that runs `spark-submit`. It also configures the logging framework to output logs in XML format. This is the same https://logging.apache.org/log4j/2.x/manual/layouts.html#XMLLayout[format] used across all Stackable products and it enables the https://vector.dev/docs/setup/deployment/roles/#aggregator[vector aggregator] to collect logs across the entire platform.
3
+
The Spark operator installs a https://vector.dev/docs/setup/deployment/roles/#agent[vector agent] as a side-car container in every application Pod except the `job` Pod that runs `spark-submit`.
4
+
It also configures the logging framework to output logs in XML format.
5
+
This is the same https://logging.apache.org/log4j/2.x/manual/layouts.html#XMLLayout[format] used across all Stackable products, and it enables the https://vector.dev/docs/setup/deployment/roles/#aggregator[vector aggregator] to collect logs across the entire platform.
4
6
5
7
It is the user's responsibility to install and configure the vector aggregator, but the agents can discover the aggregator automatically using a discovery ConfigMap as described in the xref:concepts:logging.adoc[logging concepts].
6
8
@@ -35,12 +37,12 @@ spec:
35
37
level: INFO
36
38
...
37
39
----
38
-
<1> Name of a ConfigMap that referenced the vector aggregator. See example below.
40
+
<1> Name of a ConfigMap that referenced the vector aggregator.
41
+
See example below.
39
42
<2> Enable the vector agent in the history pod.
40
43
<3> Configure log levels for file and console outputs.
Spark applications are submitted to the Spark Operator as SparkApplication resources. These resources are used to define the configuration of the Spark job, including the image to use, the main application file, and the number of executors to start.
3
+
Spark applications are submitted to the Spark Operator as SparkApplication resources.
4
+
These resources are used to define the configuration of the Spark job, including the image to use, the main application file, and the number of executors to start.
4
5
5
-
Upon creation, the application's status set to `Unknown`. As the operator creates the necessary resources, the status of the application transitions through different phases that reflect the phase of the driver Pod. A successful application will eventually reach the `Succeeded` phase.
6
+
Upon creation, the application's status set to `Unknown`.
7
+
As the operator creates the necessary resources, the status of the application transitions through different phases that reflect the phase of the driver Pod. A successful application eventually reaches the `Succeeded` phase.
6
8
7
-
NOTE: The operator will never reconcile an application once it has been created. To resubmit an application, a new SparkApplication resource must be created.
9
+
NOTE: The operator never reconciles an application once it has been created.
10
+
To resubmit an application, a new SparkApplication resource must be created.
0 commit comments