|
2 | 2 | :description: The Stackable Operator for Apache Spark is a Kubernetes operator that can manage Apache Spark clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Spark versions.
|
3 | 3 | :keywords: Stackable Operator, Apache Spark, Kubernetes, operator, data science, engineer, big data, CRD, StatefulSet, ConfigMap, Service, S3, demo, version
|
4 | 4 |
|
5 |
| -This is an operator manages https://spark.apache.org/[Apache Spark] on Kubernetes clusters. Apache Spark is a powerful open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, real-time streaming, machine learning, and graph processing. |
| 5 | +:structured-streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html |
| 6 | + |
| 7 | +This is an operator manages https://spark.apache.org/[Apache Spark] on Kubernetes clusters. Apache Spark is a powerful |
| 8 | +open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory |
| 9 | +processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, |
| 10 | +real-time streaming, machine learning, and graph processing. |
6 | 11 |
|
7 | 12 | == Getting Started
|
8 | 13 |
|
9 |
| -Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable Operator. The guide will lead you through the installation of the Operator and running your first Spark application on Kubernetes. |
| 14 | +Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable Operator. The |
| 15 | +guide will lead you through the installation of the Operator and running your first Spark application on Kubernetes. |
10 | 16 |
|
11 | 17 | == How the Operator works
|
12 | 18 |
|
13 |
| -The Stackable Operator for Apache Spark reads a _SparkApplication custom resource_ which you use to define your spark job/application. The Operator creates the relevant Kubernetes resources for the job to run. |
| 19 | +The Stackable Operator for Apache Spark reads a _SparkApplication custom resource_ which you use to define your spark |
| 20 | +job/application. The Operator creates the relevant Kubernetes resources for the job to run. |
14 | 21 |
|
15 | 22 | === Custom resources
|
16 | 23 |
|
17 | 24 | The Operator manages two custom resource kinds: The _SparkApplication_ and the _SparkHistoryServer_.
|
18 | 25 |
|
19 |
| -The SparkApplication resource is the main point of interaction with the Operator. Unlike other Stackable Operator custom resources, the SparkApplication does not have xref:concepts:roles-and-role-groups.adoc[roles]. An exhaustive list of options is given on the xref:crd-reference.adoc[] page. |
| 26 | +The SparkApplication resource is the main point of interaction with the Operator. Unlike other Stackable Operator custom |
| 27 | +resources, the SparkApplication does not have xref:concepts:roles-and-role-groups.adoc[roles]. An exhaustive list of |
| 28 | +options is given on the xref:crd-reference.adoc[] page. |
20 | 29 |
|
21 |
| -The xref:usage-guide/history-server.adoc[SparkHistoryServer] does have a single `node` role. It is used to deploy a https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact[Spark history server]. It reads data from an S3 bucket that you configure. Your applications need to write their logs to the same bucket. |
| 30 | +The xref:usage-guide/history-server.adoc[SparkHistoryServer] does have a single `node` role. It is used to deploy a |
| 31 | +https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact[Spark history server]. It reads data from an |
| 32 | +S3 bucket that you configure. Your applications need to write their logs to the same bucket. |
22 | 33 |
|
23 | 34 | === Kubernetes resources
|
24 | 35 |
|
25 | 36 | For every SparkApplication deployed to the cluster the Operator creates a Job, A ServiceAccout and a few ConfigMaps.
|
26 | 37 |
|
27 | 38 | image::spark_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the operator]
|
28 | 39 |
|
29 |
| -The Job runs `spark-submit` in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured in the SparkApplication resource. |
| 40 | +The Job runs `spark-submit` in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based |
| 41 | +on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured |
| 42 | +in the SparkApplication resource. |
30 | 43 |
|
31 |
| -The two main ConfigMaps are the `<name>-driver-pod-template` and `<name>-executor-pod-template` which define how the driver and executor Pods should be created. |
| 44 | +The two main ConfigMaps are the `<name>-driver-pod-template` and `<name>-executor-pod-template` which define how the |
| 45 | +driver and executor Pods should be created. |
32 | 46 |
|
33 |
| -The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a service to connect to. |
| 47 | +The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role |
| 48 | +group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a |
| 49 | +service to connect to. |
34 | 50 |
|
35 | 51 | === RBAC
|
36 | 52 |
|
37 |
| -The https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac[Spark-Kubernetes RBAC documentation] describes what is needed for `spark-submit` jobs to run successfully: minimally a role/cluster-role to allow the driver pod to create and manage executor pods. |
| 53 | +The https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac[Spark-Kubernetes RBAC documentation] describes |
| 54 | +what is needed for `spark-submit` jobs to run successfully: minimally a role/cluster-role to allow the driver pod to |
| 55 | +create and manage executor pods. |
38 | 56 |
|
39 |
| -However, to add security, each `spark-submit` job launched by the spark-k8s operator will be assigned its own ServiceAccount. |
| 57 | +However, to add security, each `spark-submit` job launched by the spark-k8s operator will be assigned its own |
| 58 | +ServiceAccount. |
40 | 59 |
|
41 |
| -When the spark-k8s operator is installed via Helm, a cluster role named `spark-k8s-clusterrole` is created with pre-defined permissions. |
| 60 | +When the spark-k8s operator is installed via Helm, a cluster role named `spark-k8s-clusterrole` is created with |
| 61 | +pre-defined permissions. |
42 | 62 |
|
43 |
| -When a new Spark application is submitted, the operator creates a new service account with the same name as the application and binds this account to the cluster role `spark-k8s-clusterrole` created by Helm. |
| 63 | +When a new Spark application is submitted, the operator creates a new service account with the same name as the |
| 64 | +application and binds this account to the cluster role `spark-k8s-clusterrole` created by Helm. |
44 | 65 |
|
45 | 66 | == Integrations
|
46 | 67 |
|
47 |
| -You can read and write data from xref:usage-guide/s3.adoc[s3 buckets], load xref:usage-guide/job-dependencies[custom job dependencies]. Spark also supports easy integration with Apache Kafka which is also supported xref:kafka:index.adoc[on the Stackable Data Platform]. Have a look at the demos below to see it in action. |
| 68 | +You can read and write data from xref:usage-guide/s3.adoc[s3 buckets], load xref:usage-guide/job-dependencies[custom job |
| 69 | +dependencies]. Spark also supports easy integration with Apache Kafka which is also supported xref:kafka:index.adoc[on |
| 70 | +the Stackable Data Platform]. Have a look at the demos below to see it in action. |
48 | 71 |
|
49 | 72 | == [[demos]]Demos
|
50 | 73 |
|
51 |
| -The xref:stackablectl::demos/data-lakehouse-iceberg-trino-spark.adoc[] demo connects multiple components and datasets into a data Lakehouse. A Spark application with https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html[structured streaming] is used to stream data from Apache Kafka into the Lakehouse. |
| 74 | +The xref:demos:data-lakehouse-iceberg-trino-spark.adoc[] demo connects multiple components and datasets into a data |
| 75 | +Lakehouse. A Spark application with {structured-streaming}[structured streaming] is used to stream data from Apache |
| 76 | +Kafka into the Lakehouse. |
52 | 77 |
|
53 |
| -In the xref:stackablectl::demos/spark-k8s-anomaly-detection-taxi-data.adoc[] demo Spark is used to read training data from S3 and train an anomaly detection model on the data. The model is then stored in a Trino table. |
| 78 | +In the xref:demos:spark-k8s-anomaly-detection-taxi-data.adoc[] demo Spark is used to read training data from S3 and |
| 79 | +train an anomaly detection model on the data. The model is then stored in a Trino table. |
54 | 80 |
|
55 | 81 | == Supported Versions
|
56 | 82 |
|
|
0 commit comments