|
1 | 1 | = Stackable Operator for Apache Airflow
|
| 2 | +:description: The Stackable Operator for Apache Airflow is a Kubernetes operator that can manage Apache Airflow clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Airflow versions. |
| 3 | +:keywords: Stackable Operator, Apache Airflow, Kubernetes, k8s, operator, engineer, big data, metadata, job pipeline, scheduler, workflow, ETL |
2 | 4 |
|
3 |
| -This is an operator for Kubernetes that can manage https://airflow.apache.org/[Apache Airflow] |
4 |
| -clusters. |
| 5 | +The Stackable Operator for Apache Airflow manages https://airflow.apache.org/[Apache Airflow] instances on Kubernetes. |
| 6 | +Apache Airflow is an open-source application for creating, scheduling, and monitoring workflows. Workflows are defined as code, with tasks that can be run on a variety of platforms, including Hadoop, Spark, and Kubernetes itself. Airflow is a popular choice to orchestrate ETL workflows and data pipelines. |
5 | 7 |
|
6 |
| -WARNING: This operator is part of the Stackable Data Platform and only works with images from the |
7 |
| -https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fairflow[Stackable] repository. |
| 8 | +== Getting started |
| 9 | + |
| 10 | +Get started using Airflow with the Stackable Operator by following the xref:getting_started/index.adoc[] guide. It guides you through installing the Operator alongside a PostgreSQL database and Redis instance, connecting to your Airflow instance and running your first workflow. |
| 11 | + |
| 12 | +== Resources |
| 13 | + |
| 14 | +The Operator manages three https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/[custom resources]: The _AirflowCluster_ and _AirflowDB_. It creates a number of different Kubernetes resources based on the custom resources. |
| 15 | + |
| 16 | +=== Custom resources |
| 17 | + |
| 18 | +The AirflowCluster is the main resource for the configuration of the Airflow instance. The resource defines three xref:concepts:roles-and-role-groups.adoc[roles]: `webserver`, `worker` and `scheduler`. The various configuration options are explained in the xref:usage-guide/index.adoc[]. It helps you tune your cluster to your needs by configuring xref:usage-guide/storage-resources.adoc[resource usage], xref:usage-guide/security.adoc[security], xref:usage-guide/logging.adoc[logging] and more. |
| 19 | + |
| 20 | +When an AirflowCluster is first deployed, an AirflowDB resource is created. The AirflowDB resource is a wrapper resource for the metadata SQL database that is used by Airflow to store information on users and permissions as well as workflows, task instances and their execution. The resource contains some configuration but also keeps track of whether the database has been initialized or not. It is not deleted automatically if a AirflowCluster is deleted, and so can be reused. |
| 21 | + |
| 22 | +=== Kubernetes resources |
| 23 | + |
| 24 | +Based on the custom resources you define, the Operator creates ConfigMaps, StatefulSets and Services. |
| 25 | + |
| 26 | +image::airflow_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the operator] |
| 27 | + |
| 28 | +The diagram above depicts all the Kubernetes resources created by the operator, and how they relate to each other. The Job created for the AirflowDB is not shown. |
| 29 | + |
| 30 | +For every xref:concepts:roles-and-role-groups.adoc#_role_groups[role group] you define, the Operator creates a StatefulSet with the amount of replicas defined in the RoleGroup. Every Pod in the StatefulSet has two containers: the main container running Airflow and a sidecar container gathering metrics for xref:operators:monitoring.adoc[]. The Operator creates a Service per role group as well as a single service for the whole `webserver` role called `<clustername>-webserver`. |
| 31 | + |
| 32 | +// TODO configmaps? |
| 33 | +ConfigMaps are created, one per RoleGroup and also one for the AirflowDB. Both ConfigMaps contains two files: `log_config.py` and `webserver_config.py` which contain logging and general Airflow configuration respectively. |
| 34 | + |
| 35 | +== Dependencies |
| 36 | + |
| 37 | +Airflow requires an SQL database in which to store its metadata. The Stackable platform does not have its own Operator for an SQL database but the xref:getting_started/index.adoc[] guides you through installing an example database with an Airflow instance that you can use to get started. |
| 38 | + |
| 39 | +== Using custom workflows/DAGs |
| 40 | + |
| 41 | +https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html[Direct acyclic graphs (DAGs) of tasks] are the core entities you will use in Airflow. Have a look at the page on xref:usage-guide/mounting-dags.adoc[] to learn about the different ways of loading your custom DAGs into Airflow. |
| 42 | + |
| 43 | +== Demo |
| 44 | + |
| 45 | +You can install the xref:stackablectl::demos/airflow-scheduled-job.adoc[] demo and explore an Airflow installation, as well as how it interacts with xref:spark-k8s:index.adoc[Apache Spark]. |
8 | 46 |
|
9 | 47 | == Supported Versions
|
10 | 48 |
|
11 | 49 | The Stackable Operator for Apache Airflow currently supports the following versions of Airflow:
|
12 | 50 |
|
13 | 51 | include::partial$supported-versions.adoc[]
|
14 |
| - |
15 |
| -== Docker |
16 |
| - |
17 |
| -[source] |
18 |
| ----- |
19 |
| -docker pull docker.stackable.tech/stackable/airflow:<version> |
20 |
| ----- |
|
0 commit comments