You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/modules/hdfs/pages/index.adoc
+26-12Lines changed: 26 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,25 @@
1
1
= Stackable Operator for Apache HDFS
2
+
:description: The Stackable Operator for Apache HDFS is a Kubernetes operator that can manage Apache HDFS clusters. Learn about its features, resources, dependencies and demos, and see the list of supported HDFS versions.
The Stackable Operator for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[Apache HDFS] is used to set up HFDS in high-availability mode. It depends on the xref:zookeeper:ROOT:index.adoc[] to operate a ZooKeeper cluster to coordinate the active and standby NameNodes.
5
+
The Stackable Operator for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[Apache HDFS] (Hadoop Distributed File System) is used to set up HFDS in high-availability mode. HDFS is a distributed file system designed to store and manage massive amounts of data across multiple machines in a fault-tolerant manner. The Operator depends on the xref:zookeeper:index.adoc[] to operate a ZooKeeper cluster to coordinate the active and standby NameNodes.
4
6
5
-
NOTE: This operator only works with images from the https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fhadoop[Stackable] repository
7
+
== Getting started
6
8
7
-
== Roles
9
+
Follow the xref:getting_started/index.adoc[Getting started guide] which will guide you through installing the Stackable HDFS and ZooKeeper Operators, setting up ZooKeeper and HDFS and writing a file to HDFS to verify that everything is set up correctly.
8
10
9
-
Three xref:home:concepts:roles-and-role-groups.adoc[roles] of the HDFS cluster are implemented:
11
+
Afterwards you can consult the xref:usage-guide/index.adoc[] to learn more about tailoring your HDFS configuration to your needs, or have a look at the <<demos, demos>> for some example setups.
12
+
13
+
== Operator model
14
+
15
+
The Operator manages the _HdfsCluster_ custom resource. The cluster implements three xref:home:concepts:roles-and-role-groups.adoc[roles]:
10
16
11
17
* DataNode - responsible for storing the actual data.
12
18
* JournalNode - responsible for keeping track of HDFS blocks and used to perform failovers in case the active NameNode fails. For details see: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
13
19
* NameNode - responsible for keeping track of HDFS blocks and providing access to the data.
14
20
15
-
== Kubernetes objects
21
+
22
+
image::hdfs_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the Stackable Operator for Apache HDFS]
16
23
17
24
The operator creates the following K8S objects per role group defined in the custom resource.
18
25
@@ -28,15 +35,22 @@ In the custom resource you can specify the number of replicas per role group (Na
28
35
* 1 JournalNode
29
36
* 1 DataNode (should match at least the `clusterConfig.dfsReplication` factor)
30
37
38
+
The Operator creates a xref:concepts:service_discovery.adoc[service discovery ConfigMap] for the HDFS instance. The discovery ConfigMap contains the `core-site.xml` file and the `hdfs-site.xml` file.
39
+
40
+
== Dependencies
41
+
42
+
HDFS depends on ZooKeeper for coordination between nodes. You can run a ZooKeeper cluster with the xref:zookeeper:index.adoc[]. Additionally, the xref:commons-operator:index.adoc[] and xref:secret-operator:index.adoc[] are needed.
43
+
44
+
== [[demos]]Demos
45
+
46
+
Two demos that use HDFS are available.
47
+
48
+
**xref:stackablectl::demos/hbase-hdfs-load-cycling-data.adoc[]** loads a dataset of cycling data from S3 into HDFS and then uses HBase to analyze the data.
49
+
50
+
**xref:stackablectl::demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc[]** showcases the integration between HDFS and Jupyter. New York Taxi data is stored in HDFS and analyzed in a Jupyter notebook.
51
+
31
52
== Supported Versions
32
53
33
54
The Stackable Operator for Apache HDFS currently supports the following versions of HDFS:
0 commit comments