Skip to content

Commit ab9cecc

Browse files
committed
Merge branch 'main' into refactor/config-overrides
2 parents 90c6e8a + f50ad32 commit ab9cecc

File tree

2 files changed

+58
-25
lines changed

2 files changed

+58
-25
lines changed

docs/modules/spark-k8s/pages/getting_started/installation.adoc

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,15 @@
11
= Installation
22

3-
On this page you will install the Stackable Spark-on-Kubernetes operator as well as the Commons and Secret operators which are required by all Stackable operators.
3+
On this page you will install the Stackable Spark-on-Kubernetes operator as well as the Commons and Secret operators
4+
which are required by all Stackable operators.
45

56
== Dependencies
67

7-
Spark applications almost always require dependencies like database drivers, REST api clients and many others. These dependencies must be available on the `classpath` of each executor (and in some cases of the driver, too). There are multiple ways to provision Spark jobs with such dependencies: some are built into Spark itself while others are implemented at the operator level. In this guide we are going to keep things simple and look at executing a Spark job that has a minimum of dependencies.
8+
Spark applications almost always require dependencies like database drivers, REST api clients and many others. These
9+
dependencies must be available on the `classpath` of each executor (and in some cases of the driver, too). There are
10+
multiple ways to provision Spark jobs with such dependencies: some are built into Spark itself while others are
11+
implemented at the operator level. In this guide we are going to keep things simple and look at executing a Spark job
12+
that has a minimum of dependencies.
813

914
More information about the different ways to define Spark jobs and their dependencies is given on the following pages:
1015

@@ -15,14 +20,13 @@ More information about the different ways to define Spark jobs and their depende
1520

1621
There are 2 ways to install Stackable operators
1722

18-
1. Using xref:stackablectl::index.adoc[]
19-
20-
2. Using a Helm chart
23+
. Using xref:management:stackablectl:index.adoc[]
24+
. Using a Helm chart
2125

2226
=== stackablectl
2327

24-
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install Operators.
25-
Follow the xref:stackablectl::installation.adoc[installation steps] for your platform.
28+
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install
29+
Operators. Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform.
2630

2731
After you have installed `stackablectl` run the following command to install the Spark-k8s operator:
2832

@@ -39,7 +43,8 @@ The tool will show
3943
[INFO ] Installing spark-k8s operator
4044
----
4145

42-
TIP: Consult the xref:stackablectl::quickstart.adoc[] to learn more about how to use stackablectl. For example, you can use the `-k` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].
46+
TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use stackablectl. For
47+
example, you can use the `--cluster kind` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].
4348

4449
=== Helm
4550

@@ -55,8 +60,10 @@ Then install the Stackable Operators:
5560
include::example$getting_started/getting_started.sh[tag=helm-install-operators]
5661
----
5762

58-
Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the `SparkApplication` (as well as the CRDs for the required operators). You are now ready to create a Spark job.
63+
Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the `SparkApplication` (as well as the
64+
CRDs for the required operators). You are now ready to create a Spark job.
5965

6066
== What's next
6167

62-
xref:getting_started/first_steps.adoc[Execute a Spark Job] and xref:getting_started/first_steps.adoc#_verify_that_it_works[verify that it works] by inspecting the pod logs.
68+
xref:getting_started/first_steps.adoc[Execute a Spark Job] and
69+
xref:getting_started/first_steps.adoc#_verify_that_it_works[verify that it works] by inspecting the pod logs.

docs/modules/spark-k8s/pages/index.adoc

Lines changed: 41 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,55 +2,81 @@
22
:description: The Stackable Operator for Apache Spark is a Kubernetes operator that can manage Apache Spark clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Spark versions.
33
:keywords: Stackable Operator, Apache Spark, Kubernetes, operator, data science, engineer, big data, CRD, StatefulSet, ConfigMap, Service, S3, demo, version
44

5-
This is an operator manages https://spark.apache.org/[Apache Spark] on Kubernetes clusters. Apache Spark is a powerful open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, real-time streaming, machine learning, and graph processing.
5+
:structured-streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
6+
7+
This is an operator manages https://spark.apache.org/[Apache Spark] on Kubernetes clusters. Apache Spark is a powerful
8+
open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory
9+
processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing,
10+
real-time streaming, machine learning, and graph processing.
611

712
== Getting Started
813

9-
Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable Operator. The guide will lead you through the installation of the Operator and running your first Spark application on Kubernetes.
14+
Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable Operator. The
15+
guide will lead you through the installation of the Operator and running your first Spark application on Kubernetes.
1016

1117
== How the Operator works
1218

13-
The Stackable Operator for Apache Spark reads a _SparkApplication custom resource_ which you use to define your spark job/application. The Operator creates the relevant Kubernetes resources for the job to run.
19+
The Stackable Operator for Apache Spark reads a _SparkApplication custom resource_ which you use to define your spark
20+
job/application. The Operator creates the relevant Kubernetes resources for the job to run.
1421

1522
=== Custom resources
1623

1724
The Operator manages two custom resource kinds: The _SparkApplication_ and the _SparkHistoryServer_.
1825

19-
The SparkApplication resource is the main point of interaction with the Operator. Unlike other Stackable Operator custom resources, the SparkApplication does not have xref:concepts:roles-and-role-groups.adoc[roles]. An exhaustive list of options is given on the xref:crd-reference.adoc[] page.
26+
The SparkApplication resource is the main point of interaction with the Operator. Unlike other Stackable Operator custom
27+
resources, the SparkApplication does not have xref:concepts:roles-and-role-groups.adoc[roles]. An exhaustive list of
28+
options is given on the xref:crd-reference.adoc[] page.
2029

21-
The xref:usage-guide/history-server.adoc[SparkHistoryServer] does have a single `node` role. It is used to deploy a https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact[Spark history server]. It reads data from an S3 bucket that you configure. Your applications need to write their logs to the same bucket.
30+
The xref:usage-guide/history-server.adoc[SparkHistoryServer] does have a single `node` role. It is used to deploy a
31+
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact[Spark history server]. It reads data from an
32+
S3 bucket that you configure. Your applications need to write their logs to the same bucket.
2233

2334
=== Kubernetes resources
2435

2536
For every SparkApplication deployed to the cluster the Operator creates a Job, A ServiceAccout and a few ConfigMaps.
2637

2738
image::spark_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the operator]
2839

29-
The Job runs `spark-submit` in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured in the SparkApplication resource.
40+
The Job runs `spark-submit` in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based
41+
on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured
42+
in the SparkApplication resource.
3043

31-
The two main ConfigMaps are the `<name>-driver-pod-template` and `<name>-executor-pod-template` which define how the driver and executor Pods should be created.
44+
The two main ConfigMaps are the `<name>-driver-pod-template` and `<name>-executor-pod-template` which define how the
45+
driver and executor Pods should be created.
3246

33-
The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a service to connect to.
47+
The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role
48+
group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a
49+
service to connect to.
3450

3551
=== RBAC
3652

37-
The https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac[Spark-Kubernetes RBAC documentation] describes what is needed for `spark-submit` jobs to run successfully: minimally a role/cluster-role to allow the driver pod to create and manage executor pods.
53+
The https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac[Spark-Kubernetes RBAC documentation] describes
54+
what is needed for `spark-submit` jobs to run successfully: minimally a role/cluster-role to allow the driver pod to
55+
create and manage executor pods.
3856

39-
However, to add security, each `spark-submit` job launched by the spark-k8s operator will be assigned its own ServiceAccount.
57+
However, to add security, each `spark-submit` job launched by the spark-k8s operator will be assigned its own
58+
ServiceAccount.
4059

41-
When the spark-k8s operator is installed via Helm, a cluster role named `spark-k8s-clusterrole` is created with pre-defined permissions.
60+
When the spark-k8s operator is installed via Helm, a cluster role named `spark-k8s-clusterrole` is created with
61+
pre-defined permissions.
4262

43-
When a new Spark application is submitted, the operator creates a new service account with the same name as the application and binds this account to the cluster role `spark-k8s-clusterrole` created by Helm.
63+
When a new Spark application is submitted, the operator creates a new service account with the same name as the
64+
application and binds this account to the cluster role `spark-k8s-clusterrole` created by Helm.
4465

4566
== Integrations
4667

47-
You can read and write data from xref:usage-guide/s3.adoc[s3 buckets], load xref:usage-guide/job-dependencies[custom job dependencies]. Spark also supports easy integration with Apache Kafka which is also supported xref:kafka:index.adoc[on the Stackable Data Platform]. Have a look at the demos below to see it in action.
68+
You can read and write data from xref:usage-guide/s3.adoc[s3 buckets], load xref:usage-guide/job-dependencies[custom job
69+
dependencies]. Spark also supports easy integration with Apache Kafka which is also supported xref:kafka:index.adoc[on
70+
the Stackable Data Platform]. Have a look at the demos below to see it in action.
4871

4972
== [[demos]]Demos
5073

51-
The xref:stackablectl::demos/data-lakehouse-iceberg-trino-spark.adoc[] demo connects multiple components and datasets into a data Lakehouse. A Spark application with https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html[structured streaming] is used to stream data from Apache Kafka into the Lakehouse.
74+
The xref:demos:data-lakehouse-iceberg-trino-spark.adoc[] demo connects multiple components and datasets into a data
75+
Lakehouse. A Spark application with {structured-streaming}[structured streaming] is used to stream data from Apache
76+
Kafka into the Lakehouse.
5277

53-
In the xref:stackablectl::demos/spark-k8s-anomaly-detection-taxi-data.adoc[] demo Spark is used to read training data from S3 and train an anomaly detection model on the data. The model is then stored in a Trino table.
78+
In the xref:demos:spark-k8s-anomaly-detection-taxi-data.adoc[] demo Spark is used to read training data from S3 and
79+
train an anomaly detection model on the data. The model is then stored in a Trino table.
5480

5581
== Supported Versions
5682

0 commit comments

Comments
 (0)