Skip to content

Commit de685fa

Browse files
Felix Hennigfhennig
Felix Hennig
andcommitted
Split up usage page & new index page (#260)
# Description see stackabletech/documentation#282 Co-authored-by: Felix Hennig <[email protected]>
1 parent cf21fa5 commit de685fa

16 files changed

+364
-213
lines changed

CHANGELOG.md

+6-3
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
- Add the ability to loads DAG via git-sync ([#245]).
1010
- Cluster status conditions ([#255])
1111
- Extend cluster resources for status and cluster operation (paused, stopped) ([#257])
12+
- Added more detailed landing page for the docs ([#260]).
1213

1314
### Changed
1415

@@ -20,9 +21,10 @@
2021
- `operator-rs` `0.31.0` -> `0.34.0` -> `0.39.0` ([#219]) ([#257]).
2122
- Specified security context settings needed for OpenShift ([#222]).
2223
- Fixed template parsing for OpenShift tests ([#222]).
23-
- Revert openshift settings ([#233])
24-
- Support crate2nix in dev environments ([#234])
25-
- Fixed LDAP tests on Openshift ([#254])
24+
- Revert openshift settings ([#233]).
25+
- Support crate2nix in dev environments ([#234]).
26+
- Fixed LDAP tests on Openshift ([#254]).
27+
- Reorganized usage guide docs([#260]).
2628

2729
### Removed
2830

@@ -38,6 +40,7 @@
3840
[#255]: https://github.com/stackabletech/airflow-operator/pull/255
3941
[#257]: https://github.com/stackabletech/airflow-operator/pull/257
4042
[#258]: https://github.com/stackabletech/airflow-operator/pull/258
43+
[#260]: https://github.com/stackabletech/airflow-operator/pull/260
4144

4245
## [23.1.0] - 2023-01-23
4346

docs/modules/airflow/images/airflow_overview.drawio.svg

+4
Loading

docs/modules/airflow/pages/getting_started/first_steps.adoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -159,4 +159,4 @@ include::example$getting_started/code/getting_started.sh[tag=check-dag]
159159

160160
== What's next
161161

162-
Look at the xref:usage.adoc[Usage page] to find out more about configuring your Airflow cluster and loading your own DAG files.
162+
Look at the xref:usage-guide/index.adoc[] to find out more about configuring your Airflow cluster and loading your own DAG files.

docs/modules/airflow/pages/index.adoc

+42-11
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,51 @@
11
= Stackable Operator for Apache Airflow
2+
:description: The Stackable Operator for Apache Airflow is a Kubernetes operator that can manage Apache Airflow clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Airflow versions.
3+
:keywords: Stackable Operator, Apache Airflow, Kubernetes, k8s, operator, engineer, big data, metadata, job pipeline, scheduler, workflow, ETL
24

3-
This is an operator for Kubernetes that can manage https://airflow.apache.org/[Apache Airflow]
4-
clusters.
5+
The Stackable Operator for Apache Airflow manages https://airflow.apache.org/[Apache Airflow] instances on Kubernetes.
6+
Apache Airflow is an open-source application for creating, scheduling, and monitoring workflows. Workflows are defined as code, with tasks that can be run on a variety of platforms, including Hadoop, Spark, and Kubernetes itself. Airflow is a popular choice to orchestrate ETL workflows and data pipelines.
57

6-
WARNING: This operator is part of the Stackable Data Platform and only works with images from the
7-
https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fairflow[Stackable] repository.
8+
== Getting started
9+
10+
Get started using Airflow with the Stackable Operator by following the xref:getting_started/index.adoc[] guide. It guides you through installing the Operator alongside a PostgreSQL database and Redis instance, connecting to your Airflow instance and running your first workflow.
11+
12+
== Resources
13+
14+
The Operator manages three https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/[custom resources]: The _AirflowCluster_ and _AirflowDB_. It creates a number of different Kubernetes resources based on the custom resources.
15+
16+
=== Custom resources
17+
18+
The AirflowCluster is the main resource for the configuration of the Airflow instance. The resource defines three xref:concepts:roles-and-role-groups.adoc[roles]: `webserver`, `worker` and `scheduler`. The various configuration options are explained in the xref:usage-guide/index.adoc[]. It helps you tune your cluster to your needs by configuring xref:usage-guide/storage-resources.adoc[resource usage], xref:usage-guide/security.adoc[security], xref:usage-guide/logging.adoc[logging] and more.
19+
20+
When an AirflowCluster is first deployed, an AirflowDB resource is created. The AirflowDB resource is a wrapper resource for the metadata SQL database that is used by Airflow to store information on users and permissions as well as workflows, task instances and their execution. The resource contains some configuration but also keeps track of whether the database has been initialized or not. It is not deleted automatically if a AirflowCluster is deleted, and so can be reused.
21+
22+
=== Kubernetes resources
23+
24+
Based on the custom resources you define, the Operator creates ConfigMaps, StatefulSets and Services.
25+
26+
image::airflow_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the operator]
27+
28+
The diagram above depicts all the Kubernetes resources created by the operator, and how they relate to each other. The Job created for the AirflowDB is not shown.
29+
30+
For every xref:concepts:roles-and-role-groups.adoc#_role_groups[role group] you define, the Operator creates a StatefulSet with the amount of replicas defined in the RoleGroup. Every Pod in the StatefulSet has two containers: the main container running Airflow and a sidecar container gathering metrics for xref:operators:monitoring.adoc[]. The Operator creates a Service per role group as well as a single service for the whole `webserver` role called `<clustername>-webserver`.
31+
32+
// TODO configmaps?
33+
ConfigMaps are created, one per RoleGroup and also one for the AirflowDB. Both ConfigMaps contains two files: `log_config.py` and `webserver_config.py` which contain logging and general Airflow configuration respectively.
34+
35+
== Dependencies
36+
37+
Airflow requires an SQL database in which to store its metadata. The Stackable platform does not have its own Operator for an SQL database but the xref:getting_started/index.adoc[] guides you through installing an example database with an Airflow instance that you can use to get started.
38+
39+
== Using custom workflows/DAGs
40+
41+
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html[Direct acyclic graphs (DAGs) of tasks] are the core entities you will use in Airflow. Have a look at the page on xref:usage-guide/mounting-dags.adoc[] to learn about the different ways of loading your custom DAGs into Airflow.
42+
43+
== Demo
44+
45+
You can install the xref:stackablectl::demos/airflow-scheduled-job.adoc[] demo and explore an Airflow installation, as well as how it interacts with xref:spark-k8s:index.adoc[Apache Spark].
846

947
== Supported Versions
1048

1149
The Stackable Operator for Apache Airflow currently supports the following versions of Airflow:
1250

1351
include::partial$supported-versions.adoc[]
14-
15-
== Docker
16-
17-
[source]
18-
----
19-
docker pull docker.stackable.tech/stackable/airflow:<version>
20-
----
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
= Applying Custom Resources
2+
3+
Airflow can be used to apply custom resources from within a cluster. An example of this could be a SparkApplication job that is to be triggered by Airflow. The steps below describe how this can be done.
4+
5+
== Define an in-cluster Kubernetes connection
6+
7+
An in-cluster connection can either be created from within the Webserver UI (note that the "in cluster configuration" box is ticked):
8+
9+
image::airflow_connection_ui.png[Airflow Connections]
10+
11+
Alternatively, the connection can be https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html[defined] by an environment variable in URI format:
12+
13+
[source]
14+
AIRFLOW_CONN_KUBERNETES_IN_CLUSTER: "kubernetes://?__extra__=%7B%22extra__kubernetes__in_cluster%22%3A+true%2C+%22extra__kubernetes__kube_config%22%3A+%22%22%2C+%22extra__kubernetes__kube_config_path%22%3A+%22%22%2C+%22extra__kubernetes__namespace%22%3A+%22%22%7D"
15+
16+
This can be supplied directly in the custom resource for all roles (Airflow expects configuration to be common across components):
17+
18+
[source,yaml]
19+
----
20+
include::example$example-airflow-incluster.yaml[]
21+
----
22+
23+
== Define a cluster role for Airflow to create SparkApplication resources
24+
25+
Airflow cannot create or access SparkApplication resources by default - a cluster role is required for this:
26+
27+
[source,yaml]
28+
----
29+
include::example$example-airflow-spark-clusterrole.yaml[]
30+
----
31+
32+
and a corresponding cluster role binding:
33+
34+
[source,yaml]
35+
----
36+
include::example$example-airflow-spark-clusterrolebinding.yaml[]
37+
----
38+
39+
== DAG code
40+
41+
Now for the DAG itself. The job to be started is a simple Spark job that calculates the value of pi:
42+
43+
[source,yaml]
44+
----
45+
include::example$example-pyspark-pi.yaml[]
46+
----
47+
48+
This will called from within a DAG by using the connection that was defined earlier. It will be wrapped by the `KubernetesHook` that the Airflow Kubernetes provider makes available [here.](https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py) There are two classes that are used to:
49+
50+
- start the job
51+
- monitor the status of the job
52+
53+
These are written in-line in the python code below, though this is just to make it clear how the code is used (the classes `SparkKubernetesOperator` and `SparkKubernetesSensor` will be used for all custom resources and thus are best defined as separate python files that the DAG would reference).
54+
55+
[source,python]
56+
----
57+
include::example$example-spark-dag.py[]
58+
----
59+
<1> the wrapper class used for calling the job via `KubernetesHook`
60+
<2> the connection that created for in-cluster usage
61+
<3> the wrapper class used for monitoring the job via `KubernetesHook`
62+
<4> the start of the DAG code
63+
<5> the initial task to invoke the job
64+
<6> the subsequent task to monitor the job
65+
<7> the jobs are chained together in the correct order
66+
67+
Once this DAG is xref:usage-guide/mounting-dags.adoc[mounted] in the DAG folder it can be called and its progress viewed from within the Webserver UI:
68+
69+
image::airflow_dag_graph.png[Airflow Connections]
70+
71+
Clicking on the "spark_pi_monitor" task and selecting the logs shows that the status of the job has been tracked by Airflow:
72+
73+
image::airflow_dag_log.png[Airflow Connections]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
= Usage guide
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
= Log aggregation
2+
3+
The logs can be forwarded to a Vector log aggregator by providing a discovery
4+
ConfigMap for the aggregator and by enabling the log agent:
5+
6+
[source,yaml]
7+
----
8+
spec:
9+
vectorAggregatorConfigMapName: vector-aggregator-discovery
10+
webservers:
11+
config:
12+
logging:
13+
enableVectorAgent: true
14+
containers:
15+
airflow:
16+
loggers:
17+
"flask_appbuilder":
18+
level: WARN
19+
workers:
20+
config:
21+
logging:
22+
enableVectorAgent: true
23+
containers:
24+
airflow:
25+
loggers:
26+
"airflow.processor":
27+
level: INFO
28+
schedulers:
29+
config:
30+
logging:
31+
enableVectorAgent: true
32+
containers:
33+
airflow:
34+
loggers:
35+
"airflow.processor_manager":
36+
level: INFO
37+
databaseInitialization:
38+
logging:
39+
enableVectorAgent: true
40+
----
41+
42+
Further information on how to configure logging, can be found in
43+
xref:home:concepts:logging.adoc[].
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
= Monitoring
2+
3+
The managed Airflow instances are automatically configured to export Prometheus metrics. See
4+
xref:home:operators:monitoring.adoc[] for more details.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
= Mounting DAGs
2+
3+
DAGs can be mounted by using a `ConfigMap` or a `PersistentVolumeClaim`. This is best illustrated with an example of each, shown in the next section.
4+
5+
== via `ConfigMap`
6+
7+
[source,python]
8+
----
9+
include::example$example-configmap.yaml[]
10+
----
11+
----
12+
include::example$example-airflow-dags-configmap.yaml[]
13+
----
14+
<1> The name of the configuration map
15+
<2> The name of the DAG (this is a renamed copy of the `example_bash_operator.py` from the Airflow examples)
16+
[source,yaml]
17+
<3> The volume backed by the configuration map
18+
<4> The name of the configuration map referenced by the Airflow cluster
19+
<5> The name of the mounted volume
20+
<6> The path of the mounted resource. Note that should map to a single DAG.
21+
<7> The resource has to be defined using `subPath`: this is to prevent the versioning of configuration map elements which may cause a conflict with how Airflow propagates DAGs between its components.
22+
<8> If the mount path described above is anything other than the standard location (the default is `$AIRFLOW_HOME/dags`), then the location should be defined using the relevant environment variable.
23+
24+
The advantage of this approach is that a DAG can be provided "in-line", as it were. This becomes cumbersome when multiple DAGs are to be made available in this way, as each one has to be mapped individually. For multiple DAGs it is probably easier to expose them all via a mounted volume, which is shown below.
25+
26+
=== via `git-sync`
27+
28+
==== Overview
29+
30+
https://github.com/kubernetes/git-sync/tree/release-3.x[git-sync] is a command that pulls a git repository into a local directory and is supplied as a sidecar container for use within Kubernetes. The Stackable implementation is a wrapper around this such that the binary and image requirements are included in the Stackable Airflow product images and do not need to be specified or handled in the `AirflowCluster` custom resource. Internal details such as image names and volume mounts are handled by the operator, so that only the repository and synchronisation details are required. An example of this usage is given in the next section.
31+
32+
==== Example
33+
34+
[source,yaml]
35+
----
36+
include::example$example-airflow-gitsync.yaml[]
37+
----
38+
39+
<1> A `Secret` used for accessing database and admin user details (included here to illustrate where different credential secrets are defined)
40+
<2> The git-gync configuration block that contains list of git-sync elements
41+
<3> The repository that will be cloned (required)
42+
<4> The branch name (defaults to `main`)
43+
<5> The location of the DAG folder, relative to the synced repository root (required)
44+
<6> The depth of syncing i.e. the number of commits to clone (defaults to 1)
45+
<7> The synchronisation interval in seconds (defaults to 20 seconds)
46+
<8> The name of the `Secret` used to access the repository if it is not public. This should include two fields: `user` and `password` (which can be either a password - which is not recommended - or a github token, as described https://github.com/kubernetes/git-sync/tree/v3.6.4#flags-which-configure-authentication[here])
47+
<9> A map of optional configuration settings that are listed in https://github.com/kubernetes/git-sync/tree/v3.6.4#primary-flags[this] configuration section (and the ones that follow on that link)
48+
<10> An example showing how to specify a target revision (the default is HEAD). The revision can also a be tag or a commit, though this assumes that the target hash is contained within the number of commits specified by `depth`. If a tag or commit hash is specified, then git-sync will recognise that and not perform further cloning.
49+
50+
51+
IMPORTANT: The example above shows a _*list*_ of git-sync definitions, with a single element. This is to avoid breaking-changes in future releases. Currently, only one such git-sync definition is considered and processed.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
2+
= Configuration & Environment Overrides
3+
4+
The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).
5+
6+
IMPORTANT: Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems. Additionally, for Airflow it is recommended
7+
that each component has the same configuration: not all components use each setting, but some things - such as external end-points - need to be consistent for things to work as expected.
8+
9+
== Configuration Properties
10+
11+
Airflow exposes an environment variable for every Airflow configuration setting, a list of which can be found in the https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html[Configuration Reference].
12+
13+
Although Kubernetes can override these settings in one of two ways (Configuration overrides, or Environment Variable overrides), the affect is the same
14+
and currently only the latter is implemented. This is described in the following section.
15+
16+
== Environment Variables
17+
18+
These can be set - or overwritten - at either the role level:
19+
20+
[source,yaml]
21+
----
22+
webservers:
23+
envOverrides:
24+
AIRFLOW__WEBSERVER__AUTO_REFRESH_INTERVAL: "8"
25+
roleGroups:
26+
default:
27+
replicas: 1
28+
----
29+
30+
Or per role group:
31+
32+
[source,yaml]
33+
----
34+
webservers:
35+
roleGroups:
36+
default:
37+
envOverrides:
38+
AIRFLOW__WEBSERVER__AUTO_REFRESH_INTERVAL: "8"
39+
replicas: 1
40+
----
41+
42+
In both examples above we are replacing the default value of the UI DAG refresh (3s) with 8s. Note that all override property values must be strings.

docs/modules/airflow/pages/pod_placement.adoc renamed to docs/modules/airflow/pages/usage-guide/pod-placement.adoc

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,4 @@ You can configure the Pod placement of the Airflow pods as described in xref:con
55
The default affinities created by the operator are:
66

77
1. Co-locate all the Airflow Pods (weight 20)
8-
2. Distribute all Pods within the same role (worker, webserver, scheduler) (weight 70)
8+
2. Distribute all Pods within the same role (worker, webserver, scheduler) (weight 70)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
= Security
2+
3+
== Authentication
4+
Every user has to authenticate themselves before using Airflow and there are several ways of doing this.
5+
6+
=== Webinterface
7+
The default setting is to view and manually set up users via the Webserver UI. Note the blue "+" button where users can be added directly:
8+
9+
image::airflow_security.png[Airflow Security menu]
10+
11+
=== LDAP
12+
13+
Airflow supports xref:nightly@home:concepts:authentication.adoc[authentication] of users against an LDAP server. This requires setting up an xref:nightly@home:concepts:authentication.adoc#authenticationclass[AuthenticationClass] for the LDAP server.
14+
The AuthenticationClass is then referenced in the AirflowCluster resource as follows:
15+
16+
[source,yaml]
17+
----
18+
apiVersion: airflow.stackable.tech/v1alpha1
19+
kind: AirflowCluster
20+
metadata:
21+
name: airflow-with-ldap
22+
spec:
23+
image:
24+
productVersion: 2.4.1
25+
stackableVersion: 23.4.0-rc2
26+
[...]
27+
authenticationConfig:
28+
authenticationClass: ldap # <1>
29+
userRegistrationRole: Admin # <2>
30+
----
31+
32+
<1> The reference to an AuthenticationClass called `ldap`
33+
<2> The default role that all users are assigned to
34+
35+
Users that log in with LDAP are assigned to a default https://airflow.apache.org/docs/apache-airflow/stable/security/access-control.html#access-control[Role] which is specified with the `userRegistrationRole` property.
36+
37+
You can follow the xref:nightly@home:tutorials:authentication_with_openldap.adoc[] tutorial to learn how to set up an AuthenticationClass for an LDAP server, as well as consulting the xref:nightly@home:reference:authenticationclass.adoc[] reference.
38+
39+
The users and roles can be viewed as before in the Webserver UI, but note that the blue "+" button is not available when authenticating against LDAP:
40+
41+
image::airflow_security_ldap.png[Airflow Security menu]
42+
43+
== Authorization
44+
The Airflow Webserver delegates the https://airflow.apache.org/docs/apache-airflow/stable/security/access-control.html[handling of user access control] to https://flask-appbuilder.readthedocs.io/en/latest/security.html[Flask AppBuilder].
45+
46+
=== Webinterface
47+
You can view, add to, and assign the roles displayed in the Airflow Webserver UI to existing users.
48+
49+
=== LDAP
50+
51+
Airflow supports assigning https://airflow.apache.org/docs/apache-airflow/stable/security/access-control.html#access-control[Roles] to users based on their LDAP group membership, though this is not yet supported by the Stackable operator.
52+
All the users logging in via LDAP get assigned to the same role which you can configure via the attribute `authenticationConfig.userRegistrationRole` on the `AirflowCluster` object:
53+
54+
[source,yaml]
55+
----
56+
apiVersion: airflow.stackable.tech/v1alpha1
57+
kind: AirflowCluster
58+
metadata:
59+
name: airflow-with-ldap
60+
spec:
61+
[...]
62+
authenticationConfig:
63+
authenticationClass: ldap
64+
userRegistrationRole: Admin # <1>
65+
----
66+
67+
<1> All users are assigned to the `Admin` role

0 commit comments

Comments
 (0)