-
-
Notifications
You must be signed in to change notification settings - Fork 3
[Merged by Bors] - Docs: new usage guide and index page #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 4 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
397ae52
usage guide restructuring
50cad07
Added dummy image
a9b6997
updates
513a418
Merge branch 'main' into docs/new-usage-and-index
fhennig 14e9a1d
Update the CRD reference.
razvan 071480e
Update docs/modules/spark-k8s/pages/index.adoc
fhennig 9b45161
Update docs/modules/spark-k8s/pages/usage-guide/index.adoc
fhennig 0114054
Update docs/modules/spark-k8s/pages/index.adoc
fhennig 44f055e
Removed some clutter from the diagram
aa27408
fixed links
e783cfc
Updated diagram to include Spark History Server
7bda95e
Updated spark history text
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
= CRD reference | ||
|
||
Below are listed the CRD fields that can be defined by the user: | ||
|
||
|=== | ||
|CRD field |Remarks | ||
|
||
|`apiVersion` | ||
|`spark.stackable.tech/v1alpha1` | ||
|
||
|`kind` | ||
|`SparkApplication` | ||
|
||
|`metadata.name` | ||
| Job name | ||
|
||
|`spec.version` | ||
|"1.0" | ||
|
||
|`spec.mode` | ||
| `cluster` or `client`. Currently only `cluster` is supported | ||
|
||
|`spec.image` | ||
|User-supplied image containing spark-job dependencies that will be copied to the specified volume mount | ||
|
||
|`spec.sparkImage` | ||
| Spark image which will be deployed to driver and executor pods, which must contain spark environment needed by the job e.g. `docker.stackable.tech/stackable/spark-k8s:3.3.0-stackable0.3.0` | ||
|
||
|`spec.sparkImagePullPolicy` | ||
| Optional Enum (one of `Always`, `IfNotPresent` or `Never`) that determines the pull policy of the spark job image | ||
|
||
|`spec.sparkImagePullSecrets` | ||
| An optional list of references to secrets in the same namespace to use for pulling any of the images used by a `SparkApplication` resource. Each reference has a single property (`name`) that must contain a reference to a valid secret | ||
|
||
|`spec.mainApplicationFile` | ||
|The actual application file that will be called by `spark-submit` | ||
|
||
|`spec.mainClass` | ||
|The main class i.e. entry point for JVM artifacts | ||
|
||
|`spec.args` | ||
|Arguments passed directly to the job artifact | ||
|
||
|`spec.s3connection` | ||
|S3 connection specification. See the <<S3 bucket specification>> for more details. | ||
|
||
|`spec.sparkConf` | ||
|A map of key/value strings that will be passed directly to `spark-submit` | ||
|
||
|`spec.deps.requirements` | ||
|A list of python packages that will be installed via `pip` | ||
|
||
|`spec.deps.packages` | ||
|A list of packages that is passed directly to `spark-submit` | ||
|
||
|`spec.deps.excludePackages` | ||
|A list of excluded packages that is passed directly to `spark-submit` | ||
|
||
|`spec.deps.repositories` | ||
|A list of repositories that is passed directly to `spark-submit` | ||
|
||
|`spec.volumes` | ||
|A list of volumes | ||
|
||
|`spec.volumes.name` | ||
|The volume name | ||
|
||
|`spec.volumes.persistentVolumeClaim.claimName` | ||
|The persistent volume claim backing the volume | ||
|
||
|`spec.job.resources` | ||
|Resources specification for the initiating Job | ||
|
||
|`spec.driver.resources` | ||
|Resources specification for the driver Pod | ||
|
||
|`spec.driver.volumeMounts` | ||
|A list of mounted volumes for the driver | ||
|
||
|`spec.driver.volumeMounts.name` | ||
|Name of mount | ||
|
||
|`spec.driver.volumeMounts.mountPath` | ||
|Volume mount path | ||
|
||
|`spec.driver.nodeSelector` | ||
|A dictionary of labels to use for node selection when scheduling the driver N.B. this assumes there are no implicit node dependencies (e.g. `PVC`, `VolumeMount`) defined elsewhere. | ||
|
||
|`spec.executor.resources` | ||
|Resources specification for the executor Pods | ||
|
||
|`spec.executor.instances` | ||
|Number of executor instances launched for this job | ||
|
||
|`spec.executor.volumeMounts` | ||
|A list of mounted volumes for each executor | ||
|
||
|`spec.executor.volumeMounts.name` | ||
|Name of mount | ||
|
||
|`spec.executor.volumeMounts.mountPath` | ||
|Volume mount path | ||
|
||
|`spec.executor.nodeSelector` | ||
|A dictionary of labels to use for node selection when scheduling the executors N.B. this assumes there are no implicit node dependencies (e.g. `PVC`, `VolumeMount`) defined elsewhere. | ||
|=== |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,55 @@ | ||
= Stackable Operator for Apache Spark on Kubernetes | ||
= Stackable Operator for Apache Spark | ||
:description: The Stackable Operator for Apache Spark is a Kubernetes operator that can manage Apache Spark clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Spark versions. | ||
:keywords: Stackable Operator, Apache Spark, Kubernetes, operator, data science, engineer, big data, CRD, StatefulSet, ConfigMap, Service, S3, demo, version | ||
|
||
This is an operator for Kubernetes that can manage https://spark.apache.org/[Apache Spark] kubernetes clusters. | ||
This is an operator for Kubernetes that can manage https://spark.apache.org/[Apache Spark] Kubernetes clusters. Apache Spark is a powerful open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, real-time streaming, machine learning, and graph processing. | ||
|
||
WARNING: This operator only works with images from the https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fspark[Stackable] repository | ||
== Getting Started | ||
|
||
Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable Operator. The guide will lead you through the installation of the Operator and running your first Spark job on Kubernetes. | ||
fhennig marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
== How the Operator works | ||
|
||
The Stackable Operator for Apache Spark reads a _SparkApplication custom resource_ which you use to define your spark job/application. The Operator creates the relevant Kubernetes resources for the job to run. | ||
|
||
=== SparkApplication custom resource | ||
|
||
The SparkApplication resource is the main point of interaction with the Operator. An exhaustive list of options is given on the xref:crd-reference.adoc[] page. | ||
|
||
Unlike other Operators, the Spark Operator does not have xref:concepts:roles-and-role-groups.adoc[roles]. | ||
|
||
=== Kubernetes resources | ||
|
||
For every SparkApplication deployed to the cluster the Operator creates a Job, A ServiceAccout and a few ConfigMaps. | ||
|
||
image::spark_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the operator] | ||
|
||
The Job runs `spark-submit` in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured in the SparkApplication resource. | ||
|
||
The two main ConfigMaps are the `<name>-driver-pod-template` and `<name>-executor-pod-template` which define how the driver and executor Pods should be created. | ||
|
||
=== RBAC | ||
|
||
The https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac[Spark-Kubernetes RBAC documentation] describes what is needed for `spark-submit` jobs to run successfully: minimally a role/cluster-role to allow the driver pod to create and manage executor pods. | ||
|
||
However, to add security, each `spark-submit` job launched by the spark-k8s operator will be assigned its own ServiceAccount. | ||
|
||
When the spark-k8s operator is installed via Helm, a cluster role named `spark-k8s-clusterrole` is created with pre-defined permissions. | ||
|
||
When a new Spark application is submitted, the operator creates a new service account with the same name as the application and binds this account to the cluster role `spark-k8s-clusterrole` created by Helm. | ||
|
||
== Integrations | ||
|
||
You can read and write data from xref:usage-guide/s3.adoc[s3 buckets], load xref:usage-guide/job-dependencies[custom job dependencies]. Spark also supports easy integration with Apache Kafka which is also supported xref:kafka:index.adoc[on the Stackable Data Platform]. Have a look at the demos below to see it in action. | ||
|
||
== [[demos]]Demos | ||
|
||
The xref:stackablectl::demos/data-lakehouse-iceberg-trino-spark.adoc[] demo connects multiple components and datasets into a data Lakehouse. A Spark application with https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html[structured streaming] is used to stream data from Apache Kafka into the Lakehouse. | ||
|
||
In the xref:stackablectl::demos/spark-k8s-anomaly-detection-taxi-data.adoc[] demo Spark is used to read training data from S3 and train an anomaly detection model on the data. The model is then stored in a Trino table. | ||
|
||
== Supported Versions | ||
|
||
The Stackable Operator for Apache Spark on Kubernetes currently supports the following versions of Spark: | ||
|
||
include::partial$supported-versions.adoc[] | ||
|
||
== Getting the Docker image | ||
|
||
[source] | ||
---- | ||
docker pull docker.stackable.tech/stackable/spark-k8s:<version> | ||
---- |
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
= Examples | ||
|
||
The following examples have the following `spec` fields in common: | ||
|
||
- `version`: the current version is "1.0" | ||
- `sparkImage`: the docker image that will be used by job, driver and executor pods. This can be provided by the user. | ||
- `mode`: only `cluster` is currently supported | ||
- `mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. | ||
- `args`: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset. | ||
- `sparkConf`: these list spark configuration settings that are passed directly to `spark-submit` and which are best defined explicitly by the user. Since the `SparkApplication` "knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together. | ||
- `volumes`: refers to any volumes needed by the `SparkApplication`, in this case an underlying `PersistentVolumeClaim`. | ||
- `driver`: driver-specific settings, including any volume mounts. | ||
- `executor`: executor-specific settings, including any volume mounts. | ||
|
||
Job-specific settings are annotated below. | ||
|
||
== Pyspark: externally located artifact and dataset | ||
|
||
[source,yaml] | ||
---- | ||
include::example$example-sparkapp-external-dependencies.yaml[] | ||
---- | ||
|
||
<1> Job python artifact (external) | ||
<2> Job argument (external) | ||
<3> List of python job requirements: these will be installed in the pods via `pip` | ||
<4> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3) | ||
<5> the name of the volume mount backed by a `PersistentVolumeClaim` that must be pre-existing | ||
<6> the path on the volume mount: this is referenced in the `sparkConf` section where the extra class path is defined for the driver and executors | ||
|
||
== Pyspark: externally located dataset, artifact available via PVC/volume mount | ||
|
||
[source,yaml] | ||
---- | ||
include::example$example-sparkapp-image.yaml[] | ||
---- | ||
|
||
<1> Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC | ||
<2> Job python artifact (local) | ||
<3> Job argument (external) | ||
<4> List of python job requirements: these will be installed in the pods via `pip` | ||
<5> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store) | ||
|
||
== JVM (Scala): externally located artifact and dataset | ||
|
||
[source,yaml] | ||
---- | ||
include::example$example-sparkapp-pvc.yaml[] | ||
---- | ||
|
||
<1> Job artifact located on S3. | ||
<2> Job main class | ||
<3> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials) | ||
<4> the name of the volume mount backed by a `PersistentVolumeClaim` that must be pre-existing | ||
<5> the path on the volume mount: this is referenced in the `sparkConf` section where the extra class path is defined for the driver and executors | ||
|
||
== JVM (Scala): externally located artifact accessed with credentials | ||
|
||
[source,yaml] | ||
---- | ||
include::example$example-sparkapp-s3-private.yaml[] | ||
---- | ||
|
||
<1> Job python artifact (located in an S3 store) | ||
<2> Artifact class | ||
<3> S3 section, specifying the existing secret and S3 end-point (in this case, MinIO) | ||
<4> Credentials referencing a secretClass (not shown in is example) | ||
<5> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources... | ||
<6> ...in this case, in an S3 store, accessed with the credentials defined in the secret | ||
|
||
== JVM (Scala): externally located artifact accessed with job arguments provided via configuration map | ||
|
||
[source,yaml] | ||
---- | ||
include::example$example-configmap.yaml[] | ||
---- | ||
[source,yaml] | ||
---- | ||
include::example$example-sparkapp-configmap.yaml[] | ||
---- | ||
<1> Name of the configuration map | ||
<2> Argument required by the job | ||
<3> Job scala artifact that requires an input argument | ||
<4> The volume backed by the configuration map | ||
<5> The expected job argument, accessed via the mounted configuration map file | ||
<6> The name of the volume backed by the configuration map that will be mounted to the driver/executor | ||
<7> The mount location of the volume (this will contain a file `/arguments/job-args.txt`) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
= Usage guide | ||
|
||
Learn how to load your own xref:usage-guide/job-dependencies.adoc[] or configure an xref:usage-guide/s3.adoc[S3 connection]. Have a look at the xref:usage-guide/examples.adoc[] to learn more about different operatoring modes. | ||
fhennig marked this conversation as resolved.
Show resolved
Hide resolved
|
1 change: 1 addition & 0 deletions
1
...les/spark-k8s/pages/job_dependencies.adoc → ...s/pages/usage-guide/job-dependencies.adoc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
= Job Dependencies | ||
:page-aliases: job_dependencies.adoc | ||
|
||
== Overview | ||
|
||
|
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
= Resource Requests | ||
|
||
include::home:concepts:stackable_resource_requests.adoc[] | ||
|
||
If no resources are configured explicitly, the operator uses the following defaults: | ||
|
||
[source,yaml] | ||
---- | ||
job: | ||
resources: | ||
cpu: | ||
min: '500m' | ||
max: "1" | ||
memory: | ||
limit: '1Gi' | ||
driver: | ||
resources: | ||
cpu: | ||
min: '1' | ||
max: "2" | ||
memory: | ||
limit: '2Gi' | ||
executor: | ||
resources: | ||
cpu: | ||
min: '1' | ||
max: "4" | ||
memory: | ||
limit: '4Gi' | ||
---- | ||
WARNING: The default values are _most likely_ not sufficient to run a proper cluster in production. Please adapt according to your requirements. | ||
For more details regarding Kubernetes CPU limits see: https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/[Assign CPU Resources to Containers and Pods]. | ||
|
||
Spark allocates a default amount of non-heap memory based on the type of job (JVM or non-JVM). This is taken into account when defining memory settings based exclusively on the resource limits, so that the "declared" value is the actual total value (i.e. including memory overhead). This may result in minor deviations from the stated resource value due to rounding differences. | ||
|
||
NOTE: It is possible to define Spark resources either directly by setting configuration properties listed under `sparkConf`, or by using resource limits. If both are used, then `sparkConf` properties take precedence. It is recommended for the sake of clarity to use *_either_* one *_or_* the other. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.