|
| 1 | += First steps |
| 2 | + |
| 3 | +Once you have followed the steps in the xref:installation.adoc[] section to install the Operator and its dependencies, you will now create a Spark job. Afterwards you can <<_verify_that_it_works, verify that it works>> by looking at the logs from the driver pod. |
| 4 | + |
| 5 | +=== Airflow |
| 6 | + |
| 7 | +An Airflow cluster is made of up three components: |
| 8 | + |
| 9 | +- `webserver`: this provides the main UI for user-interaction |
| 10 | +- `workers`: the nodes over which the job workload will be distributed by the scheduler |
| 11 | +- `scheduler`: responsible for triggering jobs and persisting their metadata to the backend database |
| 12 | + |
| 13 | +Create a file named `pyspark-pi.yaml` with the following contents: |
| 14 | + |
| 15 | +[source,yaml] |
| 16 | +---- |
| 17 | +include::example$code/pyspark-pi.yaml[] |
| 18 | +---- |
| 19 | + |
| 20 | +And apply it: |
| 21 | + |
| 22 | +---- |
| 23 | +include::example$code/getting_started.sh[tag=install-sparkapp] |
| 24 | +---- |
| 25 | + |
| 26 | +Where: |
| 27 | + |
| 28 | +- `metadata.name` contains the name of the SparkApplication |
| 29 | +- `spec.version`: the current version is "1.0" |
| 30 | +- `spec.sparkImage`: the docker image that will be used by job, driver and executor pods. This can be provided by the user. |
| 31 | +- `spec.mode`: only `cluster` is currently supported |
| 32 | +- `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. |
| 33 | +- `spec.driver`: driver-specific settings. |
| 34 | +- `spec.executor`: executor-specific settings. |
| 35 | + |
| 36 | + |
| 37 | +NOTE: If using Stackable image versions, please note that the version you need to specify for `spec.version` is not only the version of Spark which you want to roll out, but has to be amended with a Stackable version as shown. This Stackable version is the version of the underlying container image which is used to execute the processes. For a list of available versions please check our |
| 38 | +https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%spark-k8s%2Ftags[image registry]. |
| 39 | +It should generally be safe to simply use the latest image version that is available. |
| 40 | + |
| 41 | +This will create the SparkApplication that in turn creates the Spark job. |
| 42 | + |
| 43 | +=== Initialization of the Airflow database |
| 44 | + |
| 45 | +When creating an Airflow cluster, a database-initialization job is first started to ensure that the database schema is present and correct (i.e. populated with an admin user). A Kubernetes job is created which starts a pod to initialize the database. This can take a while. |
| 46 | + |
| 47 | +You can use kubectl to wait on the resource, although the cluster itself will not be created until this step is complete.: |
| 48 | + |
| 49 | +[source,bash] |
| 50 | +include::example$code/getting_started.sh[tag=wait-airflowdb] |
| 51 | + |
| 52 | +The job status can be inspected and verified like this: |
| 53 | + |
| 54 | +[source,bash] |
| 55 | +---- |
| 56 | +kubectl get jobs |
| 57 | +---- |
| 58 | + |
| 59 | +which will show something like this: |
| 60 | + |
| 61 | +---- |
| 62 | +NAME COMPLETIONS DURATION AGE |
| 63 | +airflow 1/1 85s 11m |
| 64 | +---- |
| 65 | + |
| 66 | +Then, make sure that all the Pods in the StatefulSets are ready: |
| 67 | + |
| 68 | +[source,bash] |
| 69 | +---- |
| 70 | +kubectl get statefulset |
| 71 | +---- |
| 72 | + |
| 73 | +The output should show all pods ready, including the external dependencies: |
| 74 | + |
| 75 | +---- |
| 76 | +NAME READY AGE |
| 77 | +airflow-postgresql 1/1 16m |
| 78 | +airflow-redis-master 1/1 16m |
| 79 | +airflow-redis-replicas 1/1 16m |
| 80 | +airflow-scheduler-default 1/1 11m |
| 81 | +airflow-webserver-default 1/1 11m |
| 82 | +airflow-worker-default 2/2 11m |
| 83 | +---- |
| 84 | + |
| 85 | +The completed set of pods for the Airflow cluster will look something like this: |
| 86 | + |
| 87 | +image::airflow_pods.png[Airflow pods] |
| 88 | + |
| 89 | +When the Airflow cluster has been created and the database is initialized, Airflow can be opened in the |
| 90 | +browser: the webserver UI port defaults to `8080` can be forwarded to the local host: |
| 91 | + |
| 92 | +---- |
| 93 | +include::example$code/getting_started.sh[tag=port-forwarding] |
| 94 | +---- |
| 95 | + |
| 96 | +== Verify that it works |
| 97 | + |
| 98 | +The Webserver UI can now be opened in the browser with `http://localhost:8080`. Enter the admin credentials from the Kubernetes secret: |
| 99 | + |
| 100 | +image::airflow_login.png[Airflow login screen] |
| 101 | + |
| 102 | +Since the examples were loaded in the cluster definition, they will appear under the DAGs tabs: |
| 103 | + |
| 104 | +image::airflow_dags.png[Example Airflow DAGs] |
| 105 | + |
| 106 | +Select one of these DAGs by clicking on the name in the left-hand column e.g. `example_complex`. Click on the arrow in the top right of the screen, select "Trigger DAG" and the DAG nodes will be automatically highlighted as the job works through its phases. |
| 107 | + |
| 108 | +image::airflow_running.png[Airflow DAG in action] |
| 109 | + |
| 110 | +Great! You have set up an Airflow cluster, connected to it and run your first DAG! |
| 111 | + |
| 112 | +== What's next |
| 113 | + |
| 114 | +Look at the xref:ROOT:usage.adoc[Usage page] to find out more about configuring your Airflow cluster and loading your own DAG files. |
0 commit comments