Skip to content

Commit 356f415

Browse files
committed
doc updates
1 parent 30394db commit 356f415

File tree

5 files changed

+37
-117
lines changed

5 files changed

+37
-117
lines changed

docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc

Lines changed: 37 additions & 117 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,16 @@
77
:pyspark: https://spark.apache.org/docs/latest/api/python/getting_started/index.html
88
:forest-algo: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf
99
:nyc-taxi: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
10-
:jupyterhub-k8s: https://github.com/jupyterhub/zero-to-jupyterhub-k8s
1110
:jupyterlab: https://jupyterlab.readthedocs.io/en/stable/
1211
:parquet: https://parquet.apache.org/
1312
:hadoop: https://hadoop.apache.org/
1413
:jupyter: https://jupyter.org
14+
:spark-connect: https://spark.apache.org/docs/latest/spark-connect-overview.html
1515

16-
This demo showcases the integration between {jupyter}[Jupyter] and {hadoop}[Apache Hadoop] deployed on the Stackable Data Platform (SDP) Kubernetes cluster.
17-
{jupyterlab}[JupyterLab] is deployed using the {jupyterhub-k8s}[pyspark-notebook stack] provided by the Jupyter community.
18-
The SDP makes this integration easy by publishing a discovery ConfigMap for the HDFS cluster.
19-
This ConfigMap is then mounted in all Pods running {pyspark}[PySpark] notebooks so that these have access to HDFS data.
16+
This demo showcases the integration between {jupyterlab}[JupyterLab], {spark-connect}[Spark Connect] and {hadoop}[Apache Hadoop] deployed on the Stackable Data Platform (SDP) Kubernetes cluster.
17+
The SDP makes this integration easy by publishing a discovery ConfigMap for the HDFS cluster and a Spark Connect service.
18+
This ConfigMap is then mounted in all Pods running {pyspark}[PySpark] so that these have access to HDFS data.
19+
The Jupyter notebook is a lightweight client that delegates the model training to the Spark Connect service.
2020
For this demo, the HDFS cluster is provisioned with a small sample of the {nyc-taxi}[NYC taxi trip dataset], which is analyzed with a notebook that is provisioned automatically in the JupyterLab interface.
2121

2222
Install this demo on an existing Kubernetes cluster:
@@ -39,12 +39,9 @@ To run this demo, your system needs at least:
3939
4040
== Aim / Context
4141

42-
This demo does not use the Stackable operator for Spark but rather delegates the creation of executor pods to JupyterHub.
43-
The intention is to demonstrate how to interact with SDP components when designing and testing Spark jobs:
44-
the resulting script and Spark job definition can then be transferred with a Stackable SparkApplication resource.
45-
When logging in to JupyterHub (described below), a pod will be created with the username as a suffix, e.g. `jupyter-admin`.
46-
Doing so runs a container hosting a Jupyter Notebook with pre-installed Spark, Java and Python.
47-
When the user creates a SparkSession, temporary spark executors are constructed that are persisted until the notebook kernel is shut down or restarted.
42+
This demo uses stackable operators to deploy a Spark Connect server and a HDFS cluster.
43+
The intention is to demonstrate how clients, in this case a JupyterLab notebook, can interact with SDP components.
44+
The notebook creates a SparkSession, that delegates the data analysis and model training to a Spark Connect service thus offloading resources into the Kubernetes cluster.
4845
The notebook can thus be used as a sandbox for writing, testing and benchmarking Spark jobs before they are moved into production.
4946

5047
== Overview
@@ -53,7 +50,7 @@ This demo will:
5350

5451
* Install the required Stackable Data Platform operators.
5552
* Spin up the following data products:
56-
** *JupyterHub*: A multi-user server for Jupyter notebooks
53+
** *JupyterLab*: A web-based interactive development environment for notebooks.
5754
** *Apache HDFS*: A distributed file system used to store the taxi dataset
5855
* Download a sample of the NY taxi dataset into HDFS.
5956
* Install Jupyter notebook.
@@ -78,61 +75,47 @@ Found 1 items
7875

7976
There should be one parquet file containing taxi trip data from September 2020.
8077

81-
== JupyterHub
78+
== JupyterLab
8279

8380
Have a look at the available Pods before logging in:
8481

8582
[source,console]
8683
----
8784
$ kubectl get pods
88-
NAME READY STATUS RESTARTS AGE
89-
hdfs-datanode-default-0 1/1 Running 0 5m12s
90-
hdfs-journalnode-default-0 1/1 Running 0 5m12s
91-
hdfs-namenode-default-0 2/2 Running 0 5m12s
92-
hdfs-namenode-default-1 2/2 Running 0 3m44s
93-
hub-567c994c8c-rbdbd 1/1 Running 0 5m36s
94-
load-test-data-5sp68 0/1 Completed 0 5m11s
95-
proxy-7bf49bb844-mhx66 1/1 Running 0 5m36s
96-
zookeeper-server-default-0 1/1 Running 0 5m12s
97-
----
98-
99-
JupyterHub will create a Pod for each active user.
100-
In order to reach the JupyterHub web interface, create a port-forward:
85+
NAME READY STATUS RESTARTS AGE
86+
hdfs-datanode-default-0 1/1 Running 0 38m
87+
hdfs-journalnode-default-0 1/1 Running 0 38m
88+
hdfs-namenode-default-0 2/2 Running 0 38m
89+
hdfs-namenode-default-1 2/2 Running 0 36m
90+
jupyterlab-76d67b9bfb-thmtq 1/1 Running 0 22m
91+
load-test-data-hcj92 0/1 Completed 0 26m
92+
spark-connect-server-66db874cbb-9nbpf 1/1 Running 0 34m
93+
spark-connect-server-9c6bfd9690213314-exec-1 1/1 Running 0 34m
94+
spark-connect-server-9c6bfd9690213314-exec-2 1/1 Running 0 34m
95+
spark-connect-server-9c6bfd9690213314-exec-3 1/1 Running 0 34m
96+
spark-connect-server-9c6bfd9690213314-exec-4 1/1 Running 0 34m
97+
zookeeper-server-default-0 1/1 Running 0 38m
98+
----
99+
100+
In order to reach the JupyterLab web interface, create a port-forward:
101101

102102
[source,console]
103103
----
104-
$ kubectl port-forward service/proxy-public 8080:http
104+
$ kubectl port-forward service/jupyterlab 8080:http
105105
----
106106

107-
WARNING: Use the `proxy-public` service and not something else!
107+
The `jupyterlab` service is created along side the deployment of rhe JupyterLab deployment.
108108

109109
Now access the JupyterHub web interface via http://localhost:8080
110110

111-
You should see the JupyterHub login page.
112-
113-
image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_login.png[]
114-
115-
Log in with username `admin` and password `adminadmin`.
116-
There should appear a new pod called `jupyter-admin`:
111+
You should see the JupyterLab login page.
117112

118-
[source,console]
119-
----
120-
$ kubectl get pods
121-
NAME READY STATUS RESTARTS AGE
122-
hdfs-datanode-default-0 1/1 Running 0 6m12s
123-
hdfs-journalnode-default-0 1/1 Running 0 6m12s
124-
hdfs-namenode-default-0 2/2 Running 0 6m12s
125-
hdfs-namenode-default-1 2/2 Running 0 4m44s
126-
hub-567c994c8c-rbdbd 1/1 Running 0 6m36s
127-
jupyter-admin 1/1 Running 0 77s
128-
load-test-data-5sp68 0/1 Completed 0 6m11s
129-
proxy-7bf49bb844-mhx66 1/1 Running 0 6m36s
130-
zookeeper-server-default-0 1/1 Running 0 6m12s
131-
----
113+
image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyterlab_login.png[]
132114

115+
Log in with token `adminadmin`.
133116
You should arrive at your workspace:
134117

135-
image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_workspace.png[]
118+
image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyterlab_workspace.png[]
136119

137120
Now you can double-click on the `notebook` folder on the left, open and run the contained file.
138121
Click on the double arrow (⏩️) to execute the Python scripts (click on the image below to go to the notebook file).
@@ -141,78 +124,15 @@ image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_run_noteb
141124

142125
You can also inspect the `hdfs` folder where the `core-site.xml` and `hdfs-site.xml` from the discovery ConfigMap of the HDFS cluster are located.
143126

144-
The image defined for the spark job must contain all dependencies needed for that job to run.
145-
For PySpark jobs, this will mean that Python libraries either need to be baked into the image or {spark-pkg}[packaged in some other way].
146-
This demo contains a custom image created from a Dockerfile that is used to generate an image containing scikit-learn, pandas and their dependencies.
147-
This is described below.
148-
149-
=== Install the libraries into a product image
150-
151-
Libraries can be added to a custom *product* image launched by the notebook. Suppose a Spark job is prepared like this:
152-
153-
// TODO (@NickLarsenNZ): Use stackable0.0.0-dev so that the demo is reproducable for the release
154-
// and it will be automatically replaced for the release branch.
155-
// Also update the reference in notebook.ipynb.
156-
157-
[source,python]
158-
----
159-
spark = (SparkSession
160-
.builder
161-
.master(f'k8s://https://{os.environ["KUBERNETES_SERVICE_HOST"]}:{os.environ["KUBERNETES_SERVICE_PORT"]}')
162-
.config("spark.kubernetes.container.image", "oci.stackable.tech/stackable/spark-k8s-with-scikit-learn:3.5.0-stackable24.3.0")
163-
.config("spark.driver.port", "2222")
164-
.config("spark.driver.blockManager.port", "7777")
165-
.config("spark.driver.host", "driver-service.default.svc.cluster.local")
166-
.config("spark.driver.bindAddress", "0.0.0.0")
167-
.config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
168-
.config("spark.kubernetes.authenticate.serviceAccountName", "spark")
169-
.config("spark.executor.instances", "4")
170-
.config("spark.kubernetes.container.image.pullPolicy", "IfNotPresent")
171-
.appName("taxi-data-anomaly-detection")
172-
.getOrCreate()
173-
)
174-
----
175-
176-
It requires a specific Spark image:
177-
178-
// TODO (@NickLarsenNZ): Use stackable0.0.0-dev so that the demo is reproducable for the release
179-
// and it will be automatically replaced for the release branch.
180-
// Also update the reference in notebook.ipynb.
181-
182-
[source,python]
183-
----
184-
.config("spark.kubernetes.container.image",
185-
"oci.stackable.tech/stackable/spark-k8s-with-scikit-learn:3.5.0-stackable24.3.0"),
186-
...
187-
----
188-
189-
This is created by taking a Spark image, in this case `oci.stackable.tech/sdp/spark-k8s:3.5.0-stackable24.3.0`, installing specific python libraries into it
190-
, and re-tagging the image:
191-
192-
[source,console]
193-
----
194-
FROM oci.stackable.tech/sdp/spark-k8s:3.5.0-stackable24.3.0
195-
196-
COPY demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/requirements.txt .
197-
198-
RUN pip install --no-cache-dir --upgrade pip && \
199-
pip install --no-cache-dir -r ./requirements.txt
200-
----
201-
202-
Where `requirements.txt` contains:
203-
204-
[source,console]
205-
----
206-
scikit-learn==1.3.1
207-
pandas==2.0.3
208-
----
209-
210-
NOTE: Using a custom image requires access to a repository where the image can be made available.
127+
The Python notebook uses libraries such as `pandas` and `scikit-learn` to analyze the data.
128+
In addition, since the model training is delegated to a Spark Connect server, some of these dependencies, most notably `scikit-learn`, must also be made available on the Spark Connect pods.
129+
For convenience, a custom image is used in this demo that bundles all the required libraries for both the notebook and the Spark Connect server.
130+
In practice, clients of Spark Connect do not need a full-blown Spark installation available locally, but only the libraries that are used in the notebook.
211131

212132
== Model details
213133

214134
The job uses an implementation of the Isolation Forest {forest-algo}[algorithm] provided by the scikit-learn {scikit-lib}[library]:
215-
the model is trained and then invoked by a user-defined function (see {forest-article}[this article] for how to call the sklearn library with a pyspark UDF), all of which is run using the Spark executors spun up in the current SparkSession.
135+
the model is trained and then invoked by a user-defined function (see {forest-article}[this article] for how to call the sklearn library with a pyspark UDF), all of which is run using the Spark connect executors.
216136
This type of model attempts to isolate each data point by continually partitioning the data.
217137
Data closely packed together will require more partitions to separate data points.
218138
In contrast, any outliers will require less: the number of partitions needed for a particular data point is thus inversely proportional to the anomaly "score".

0 commit comments

Comments
 (0)