You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: convert anomaly demo to spark-connect (#209)
* new stack: spark-connect-notebook
* use stackable image
* Update jupyterhub-pyspark-hdfs stack to use JupyterLab and Spark-connect
* Remove stack spark-connect-notebook
* delete templated jupyterhub.yaml
* delete demo Dockerfile
* various tweaks to the lab deployment
* make token a stack parameter
* use 8080
* start in /notebook
* jupyterlab fix service port
* doc updates
* remove trailing whitespace
* remove empty line
* Update docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc
Co-authored-by: Nick <[email protected]>
* Update docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc
Co-authored-by: Nick <[email protected]>
* Update docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc
Co-authored-by: Nick <[email protected]>
* link to spark connect client image
* Update spark-connect-client image references to use the correct repository path
* Update stacks/jupyterhub-pyspark-hdfs/jupyterlab.yaml
Co-authored-by: Nick <[email protected]>
* notebook: update remote connect to match listener service
* stack: install yamls from GH
---------
Co-authored-by: Nick <[email protected]>
This demo showcases the integration between {jupyter}[Jupyter] and {hadoop}[Apache Hadoop] deployed on the Stackable Data Platform (SDP) Kubernetes cluster.
17
-
{jupyterlab}[JupyterLab] is deployed using the {jupyterhub-k8s}[pyspark-notebook stack] provided by the Jupyter community.
18
-
The SDP makes this integration easy by publishing a discovery ConfigMap for the HDFS cluster.
19
-
This ConfigMap is then mounted in all Pods running {pyspark}[PySpark] notebooks so that these have access to HDFS data.
17
+
This demo showcases the integration between {jupyterlab}[JupyterLab], {spark-connect}[Spark Connect] and {hadoop}[Apache Hadoop] deployed on the Stackable Data Platform (SDP) Kubernetes cluster.
18
+
The SDP makes this integration easy by publishing a discovery ConfigMap for the HDFS cluster and a Spark Connect service.
19
+
This ConfigMap is then mounted in all Pods running {pyspark}[PySpark] so that these have access to HDFS data.
20
+
The Jupyter notebook is a lightweight client that delegates the model training to the Spark Connect service.
20
21
For this demo, the HDFS cluster is provisioned with a small sample of the {nyc-taxi}[NYC taxi trip dataset], which is analyzed with a notebook that is provisioned automatically in the JupyterLab interface.
21
22
22
23
Install this demo on an existing Kubernetes cluster:
@@ -39,12 +40,9 @@ To run this demo, your system needs at least:
39
40
40
41
== Aim / Context
41
42
42
-
This demo does not use the Stackable operator for Spark but rather delegates the creation of executor pods to JupyterHub.
43
-
The intention is to demonstrate how to interact with SDP components when designing and testing Spark jobs:
44
-
the resulting script and Spark job definition can then be transferred with a Stackable SparkApplication resource.
45
-
When logging in to JupyterHub (described below), a pod will be created with the username as a suffix, e.g. `jupyter-admin`.
46
-
Doing so runs a container hosting a Jupyter Notebook with pre-installed Spark, Java and Python.
47
-
When the user creates a SparkSession, temporary spark executors are constructed that are persisted until the notebook kernel is shut down or restarted.
43
+
This demo uses stackable operators to deploy a Spark Connect server and an HDFS cluster.
44
+
The intention is to demonstrate how clients, in this case a JupyterLab notebook, can interact with SDP components.
45
+
The notebook creates a SparkSession, that delegates the data analysis and model training to a Spark Connect service thus offloading resources into the Kubernetes cluster.
48
46
The notebook can thus be used as a sandbox for writing, testing and benchmarking Spark jobs before they are moved into production.
49
47
50
48
== Overview
@@ -53,7 +51,7 @@ This demo will:
53
51
54
52
* Install the required Stackable Data Platform operators.
55
53
* Spin up the following data products:
56
-
** *JupyterHub*: A multi-user server for Jupyter notebooks
54
+
** *JupyterLab*: A web-based interactive development environment for notebooks.
57
55
** *Apache HDFS*: A distributed file system used to store the taxi dataset
58
56
* Download a sample of the NY taxi dataset into HDFS.
59
57
* Install Jupyter notebook.
@@ -78,61 +76,47 @@ Found 1 items
78
76
79
77
There should be one parquet file containing taxi trip data from September 2020.
80
78
81
-
== JupyterHub
79
+
== JupyterLab
82
80
83
81
Have a look at the available Pods before logging in:
84
82
85
83
[source,console]
86
84
----
87
85
$ kubectl get pods
88
-
NAME READY STATUS RESTARTS AGE
89
-
hdfs-datanode-default-0 1/1 Running 0 5m12s
90
-
hdfs-journalnode-default-0 1/1 Running 0 5m12s
91
-
hdfs-namenode-default-0 2/2 Running 0 5m12s
92
-
hdfs-namenode-default-1 2/2 Running 0 3m44s
93
-
hub-567c994c8c-rbdbd 1/1 Running 0 5m36s
94
-
load-test-data-5sp68 0/1 Completed 0 5m11s
95
-
proxy-7bf49bb844-mhx66 1/1 Running 0 5m36s
96
-
zookeeper-server-default-0 1/1 Running 0 5m12s
97
-
----
98
-
99
-
JupyterHub will create a Pod for each active user.
100
-
In order to reach the JupyterHub web interface, create a port-forward:
This is created by taking a Spark image, in this case `oci.stackable.tech/sdp/spark-k8s:3.5.0-stackable24.3.0`, installing specific python libraries into it
190
-
, and re-tagging the image:
191
-
192
-
[source,console]
193
-
----
194
-
FROM oci.stackable.tech/sdp/spark-k8s:3.5.0-stackable24.3.0
The Python notebook uses libraries such as `pandas` and `scikit-learn` to analyze the data.
129
+
In addition, since the model training is delegated to a Spark Connect server, some of these dependencies, most notably `scikit-learn`, must also be made available on the Spark Connect pods.
130
+
For convenience, a custom image is used in this demo that bundles all the required libraries for both the notebook and the Spark Connect server.
131
+
The source of the image is available {spark-connect-client}[here].
209
132
210
-
NOTE: Using a custom image requires access to a repository where the image can be made available.
133
+
In practice, clients of Spark Connect do not need a full-blown Spark installation available locally, but only the libraries that are used in the notebook.
211
134
212
135
== Model details
213
136
214
137
The job uses an implementation of the Isolation Forest {forest-algo}[algorithm] provided by the scikit-learn {scikit-lib}[library]:
215
-
the model is trained and then invoked by a user-defined function (see {forest-article}[this article] for how to call the sklearn library with a pyspark UDF), all of which is run using the Spark executors spun up in the current SparkSession.
138
+
the model is trained and then invoked by a user-defined function (see {forest-article}[this article] for how to call the sklearn library with a pyspark UDF), all of which is run using the Spark Connect executors.
216
139
This type of model attempts to isolate each data point by continually partitioning the data.
217
140
Data closely packed together will require more partitions to separate data points.
218
141
In contrast, any outliers will require less: the number of partitions needed for a particular data point is thus inversely proportional to the anomaly "score".
0 commit comments