Install a storage vendor which supports CSI snapshot in preview env #10201

jenting · 2022-05-23T23:57:32Z

Is your feature request related to a problem? Please describe

Install a storage vendor which supports CSI snapshot in preview env.

Describe the behavior you'd like

The workspace team is working on backup/restoring user workspace files from the S3 to PVC volume snapshot/restore, this is to address epic #7901.

We're making the CSI snapshot/restore workable in the GCP environment. To make it easier for developer daily development, we should make the preview environment CSI snapshot/restore work as well.
However, we did a test that the local-path-provisioner doesn't support CSI snapshot/restore, therefore, we need to consider another storage vendor which supports CSI snapshot/restore, and this storage vendor needs to be installed in the preview environment as well (could be deployed as an optional by werft annotation).

Successful storge vendor criteria we consider:

We could choose one of the storage vendor from https://kubernetes-csi.github.io/docs/drivers.html
Support using the existing directory as storage if possible. If not, use an extra block device. (Since the current preview environment, we don't have enough storage for raw partition. It would be good if we could use the existing file system's directory as storage so we don't have to find an extra partition or block device as dedicated storage).
Support CSI snapshot backup/restore without a problem.
- Create Pod pod-1 with PVC pvc-1, and write some data to the PVC pvc-1.
- Create VolumeSnapshot vs-1 for the PVC pvc-1, it can back up the snapshot success, and the VolumeSnapshotContent be created.
- Create another Pod restore-pod-2 and PVC restore-pvc-2 with data source as VolumeSnapshot vs-1, check the PVC restore-pvc-2 with correct data content.
- Delete Pod pod-1 and PVC pvc-1, and delete Pod restore-pod-2 and PVC restore-pvc-2.
- Create another Pod pod-3 and PVC restore-pvc-3 with data source as VolumeSnapshot vs-1, check the PVC restore-pvc-3 with correct data content.

Describe alternatives you've considered

N/A

Additional context

#7901

We could consider the Longhorn as the storage vendor since it's the default storage of Harvester and it's easy to deploy within the Kubernetes cluster.

What we need to consider is that once we enable the preview environment with CSI snapshot/backup support

Is the local disk storage enough? Here is the reference on the existing Harvester disk usage (the Longhorn uses local node's disk as storage, currently, Harvester per node is 5T).

Should we perform a garbage collection on the orphan PVC, VolumeSnapshot, VolumeSnapshotContent? Will the replica data still exist on the local disk even if the k3s Pods be deleted? (Probably no because once the preview environment by cleaned up by the sweeper, the corresponding resource will be cleaned up as well.)

The text was updated successfully, but these errors were encountered:

meysholdt · 2022-05-24T10:41:27Z

We could consider the Longhorn as the storage vendor since it's the default storage of Harvester and it's easy to deploy within the Kubernetes cluster.

From my side, I have no objections against Longhorn. It's good that you know it well @jenting and that on the Platform side we could collect a bit of experience with it, too.

Is the local disk storage enough
How much storage should be available per preview env? Currently, preview envs request a 200 GB root drive. I understand the newer workspace images are build for smaller root volumes, so that size can probably decrease when we upgrade to a newer workspace image.

With that being said, the Harvester cluster currently has a total storage capacity of slightly above 5TB, and we intend it to run up to 65 preview envs in parallel. Based on these numbers, a preview env may use up to 76 GB on average.

Because currently, disk are so much larger than the space that's used on them, we've set Longhorn's "Storage Over Provisioning Percentage" to 1000.

Should we perform a garbage collection on the orphan PVC, VolumeSnapshot, VolumeSnapshotContent? Will the replica data still exist on the local disk even the k3s Pods be deleted? (Probably no because once the preview environment by clean up by sweeper, the corresponding resource will be clean up as well.)

I assume the proposal is to install Longhorn inside every preview env VM and I assume the Gitpod (running inside a preview env) will not interact with the Longhorn instance that's part of Harvester.
Based on that assumption, you're right: There is nothing more to do here for garbage collection because the preview environment VM get's garbage collected already and everything running inside it (Longhorn) along with it. Nit: A Werft job does GC nowadays. We're retiring sweeper.

meysholdt · 2022-05-24T10:57:56Z

Proposed code changes:
Install Longhorn here via Cloud-init, just like Certmanager:

gitpod/.werft/vm/manifests.ts

Lines 323 to 326 in bd8f2c7

    
                 kubectl apply -f /var/lib/gitpod/manifests/calico2.yaml 
        
                 kubectl apply -f /var/lib/gitpod/manifests/cert-manager.yaml 
        
                 kubectl apply -f /var/lib/gitpod/manifests/metrics-server.yaml

Disable the local-path-provisioner here by adding --disable local-storage

gitpod/.werft/vm/manifests.ts

Lines 289 to 303 in bd8f2c7

    
                 /usr/local/bin/install-k3s.sh \\ 
        
                     --token "1234" \\ 
        
                     --node-ip "$(hostname -I | cut -d ' ' -f1)" \\ 
        
                     --node-label "cloud.google.com/gke-nodepool=control-plane-pool" \\ 
        
                     --container-runtime-endpoint=/var/run/containerd/containerd.sock \\ 
        
                     --write-kubeconfig-mode 444 \\ 
        
                     --disable traefik \\ 
        
                     --disable metrics-server \\ 
        
                     --flannel-backend=none \\ 
        
                     --kubelet-arg config=/etc/kubernetes/kubelet-config.json \\ 
        
                     --kubelet-arg feature-gates=LocalStorageCapacityIsolation=true \\ 
        
                     --kubelet-arg feature-gates=LocalStorageCapacityIsolationFSQuotaMonitoring=true \\ 
        
                     --kube-apiserver-arg feature-gates=LocalStorageCapacityIsolation=true \\ 
        
                     --kube-apiserver-arg feature-gates=LocalStorageCapacityIsolationFSQuotaMonitoring=true \\ 
        
                     --cluster-init

jenting · 2022-05-26T06:02:51Z

After thinking, Longhorn v1.2.4 CSI snapshot/backup current behavior backs up/restores from the PVC content to remote S3. Therefore, it relies on the S3 bucket now.

Since we want PVC content backup/restore located in the local disk only, we need to wait for the Longhorn v1.3.0 release (The date is approximately June 09, 2022). Therefore, I'd check other storage vendors such as Rook Ceph, OpenEBS, etc or waits for Longhorn v1.3.0 release.

jenting · 2022-06-01T12:52:32Z

Install the Longhorn v1.3.0-rc2 pre-release, and apply the storage class into the cluster in workspace pre env.

kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1beta1
metadata:
  name: longhorn-snapshot-vsc
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: snap

Using the latest main branch, the backup thru. the volume snapshot controller works as expected.
But the restoring from backup does not work. Will check tomorrow.

jenting · 2022-06-07T07:20:01Z

The problem is that if the original PV/PVC was gone, even the VolumeSnapshot/VolumeSnapshotContent exists, restoring the PVC from the VolumeSnapshot fails because the Longhorn is unable to create the PV back. The error message of PVC is

  Warning  ProvisioningFailed    72s                driver.longhorn.io_csi-provisioner-869bdc4b79-rnp2r_52143314-b554-4361-818f-7267231ecee3  failed to provision volume with StorageClass "longhorn": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [message=unable to create volume: unable to create volume pvc-c718f270-3672-488b-948d-7253611f4fad: failed to verify data source: cannot get client for volume pvc-750f59f7-81d4-4716-a55c-6eacc4bec5d6: engine is not running, code=Server Error, detail=] from [http://longhorn-backend:9500/v1/volumes]
  Warning  ProvisioningFailed    56s (x2 over 85s)  driver.longhorn.io_csi-provisioner-869bdc4b79-rnp2r_52143314-b554-4361-818f-7267231ecee3  failed to provision volume with StorageClass "longhorn": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [detail=, message=unable to create volume: unable to create volume pvc-c718f270-3672-488b-948d-7253611f4fad: failed to verify data source: cannot get client for volume pvc-750f59f7-81d4-4716-a55c-6eacc4bec5d6: engine is not running, code=Server Error] from [http://longhorn-backend:9500/v1/volumes]
  Normal   Provisioning          24s (x7 over 85s)  driver.longhorn.io_csi-provisioner-869bdc4b79-rnp2r_52143314-b554-4361-818f-7267231ecee3  External provisioner is provisioning volume for claim "longhorn-system/test-restore-pvc"
  Warning  ProvisioningFailed    24s (x4 over 85s)  driver.longhorn.io_csi-provisioner-869bdc4b79-rnp2r_52143314-b554-4361-818f-7267231ecee3  failed to provision volume with StorageClass "longhorn": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=Server Error, detail=, message=unable to create volume: unable to create volume pvc-c718f270-3672-488b-948d-7253611f4fad: failed to verify data source: cannot get client for volume pvc-750f59f7-81d4-4716-a55c-6eacc4bec5d6: engine is not running] from [http://longhorn-backend:9500/v1/volumes]
  Normal   ExternalProvisioning  9s (x8 over 85s)   persistentvolume-controller                                                               waiting for a volume to be created, either by external provisioner "driver.longhorn.io" or manually created by system administrator

longhorn/longhorn#4083

kylos101 · 2022-06-07T13:45:44Z

Thank you for creating the bug with Longhorn, @jenting . 🙏

jenting · 2022-06-08T13:01:04Z

Moving back to schedule, working on other more important issues that would benefit our customers.

kylos101 · 2022-06-08T18:23:24Z

@meysholdt does Platform have bandwidth to help own this issue? 🤔 🙏 As you can see, we're having trouble with Longhorn, and are considering other options.

meysholdt · 2022-06-10T07:49:50Z

@jenting @kylos101 if we try a storage vendor that implements the CSI, how can we test whether it fulfills all your requirements?

kylos101 · 2022-06-10T12:24:01Z

@jenting can you share a detailed plan with @meysholdt for how you were doing Gitpod setup and testing of PVC in preview environments (after having installed CSI driver)?

For Gitpod setup, after having installed the CSI driver (not trivial, maturity varies) and preparing a storage-class, what else is needed, aside from enabling the PVC feature flag in Gitpod for a user to test? I assume you must configure the storage-class that we would like Gitpod to use, too.

For testing, I assume you were trying workspace start and workspace stop, when it does not work, look at the related PVC and snapshotter objects/events/logs.

jenting · 2022-06-11T05:53:19Z

@jenting @kylos101 if we try a storage vendor that implements the CSI, how can we test whether it fulfills all your requirements?

I've updated the criteria we require in this issue description.

Personally, I'd prefer the storage vendor which supports using the existing file system's directory as storage, because we don't have to create another partition or block device.

kylos101 · 2022-06-12T23:52:24Z

Thank you, @jenting for the detailed description 🙏 ! I agree, using the existing file system as storage would be ideal 💡 .

kylos101 · 2022-06-13T04:55:07Z

Should be all set, @meysholdt , @jenting updated this issue's description with supporting detail.

jenting · 2022-06-17T05:15:17Z

Install the Rook/Ceph, it works well and the CSI behavior is what we want. Here are the Rook/Ceph installation steps.

Create a new volume 30Gi (since it matches Gitpod's regular workspace default storage size) within the same namespace in the Harvester GUI.
Attach the volume to the existing VM.

Git clone the Rook/Ceph with single-node mode.

$ git clone --single-branch --branch v1.9.5 https://github.com/rook/rook.git
cd rook/deploy/examples

Edit operator.yaml, set ROOK_CSI_ENABLE_CEPHFS to false.

Install the Rook/Ceph.

kubectl create -f crds.yaml -f common.yaml -f operator.yaml
kubectl create -f cluster-test.yaml

Install the StorageClass

cd csi/rbd
sed -i 's/allowVolumeExpansion: true/allowVolumeExpansion: false/g' storageclass-test.yaml
echo "volumeBindingMode: WaitForFirstConsumer" >> storageclass-test.yaml
kubectl create -f storageclass-test.yaml

Test

Create Pod + PVC.

cd csi/rbd
kubectl create -f pod.yaml -f pvc.yaml

Write some data to PVC.

kubectl exec -it pod/csirbd-demo-pod -- dd if=/dev/urandom of=/var/lib/www/html/16M count=1 bs=16M

Calculate the sha256sum.

kubectl exec -it pod/csirbd-demo-pod -- sha256sum /var/lib/www/html/16M

Create VolumeSnapshotClass.
```
kubectl create -f snapshotclass.yaml
```
Create VolumeSnapshot for PVC and waits for the VolumeSnapshot becomes Ready.
```
kubectl create -f snapshot.yaml
kubectl get vs/rbd-pvc-snapshot -w
```
Delete Pod + PVC.
```
kubectl delete -f pod.yaml -f pvc.yaml
```
Restore PVC from VolumeSnapshot.
```
kubectl create -f pvc-restore.yaml
```

Create a Pod to use the restored PVC.

cp pod.yaml pod-restore.yaml
sed -i 's/rbd-pvc/rbd-pvc-restore/g' pod-restore.yaml
sed -i 's/csirbd-demo-pod/csirbd-demo-pod-restore/g' pod-restore.yaml
kubectl create -f pod-restore.yaml

Check the content is there and correct.

kubectl exec -it pod/csirbd-demo-pod-restore -- sha256sum /var/lib/www/html/16M

kylos101 · 2022-06-23T02:04:23Z

Thank you, @jenting ! 🙏

jenting self-assigned this May 23, 2022

jenting mentioned this issue May 23, 2022

Epic: Ensure durability for user workspace files #7901

Closed

77 tasks

jenting added this to 🌌 Workspace Team May 23, 2022

jenting moved this to In Progress in 🌌 Workspace Team May 25, 2022

jenting added the team: workspace Issue belongs to the Workspace team label May 30, 2022

jenting changed the title ~~Install a storage vendor which support CSI snapshot in preview env~~ Install a storage vendor which supports CSI snapshot in preview env Jun 8, 2022

jenting moved this from In Progress to Scheduled in 🌌 Workspace Team Jun 8, 2022

meysholdt added this to ☁️ DevX by 🚚 Delivery and Operations Experience Team Jun 9, 2022

jenting mentioned this issue Jun 13, 2022

installer: add a test to ensure that CSI is working as expected prior to installing Gitpod #10614

Open

meysholdt unassigned jenting Jun 13, 2022

meysholdt moved this to Scheduled in ☁️ DevX by 🚚 Delivery and Operations Experience Team Jun 14, 2022

jenting moved this from Scheduled to In Progress in 🌌 Workspace Team Jun 17, 2022

jenting mentioned this issue Jun 17, 2022

Install Rook/Ceph which supports CSI volume snapshot in preview environment #10718

Merged

roboquat closed this as completed in #10718 Jun 17, 2022

Repository owner moved this from In Progress to Done in 🌌 Workspace Team Jun 17, 2022

Repository owner moved this from Scheduled to Done in ☁️ DevX by 🚚 Delivery and Operations Experience Team Jun 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Install a storage vendor which supports CSI snapshot in preview env #10201

Install a storage vendor which supports CSI snapshot in preview env #10201

jenting commented May 23, 2022 •

edited

Loading

meysholdt commented May 24, 2022

Uh oh!

meysholdt commented May 24, 2022 •

edited

Loading

Uh oh!

jenting commented May 26, 2022 •

edited

Loading

Uh oh!

jenting commented Jun 1, 2022

Uh oh!

jenting commented Jun 7, 2022 •

edited

Loading

Uh oh!

kylos101 commented Jun 7, 2022

Uh oh!

jenting commented Jun 8, 2022

Uh oh!

kylos101 commented Jun 8, 2022

Uh oh!

meysholdt commented Jun 10, 2022

Uh oh!

kylos101 commented Jun 10, 2022 •

edited

Loading

Uh oh!

jenting commented Jun 11, 2022 •

edited

Loading

Uh oh!

kylos101 commented Jun 12, 2022

Uh oh!

kylos101 commented Jun 13, 2022

Uh oh!

jenting commented Jun 17, 2022 •

edited

Loading

Uh oh!

kylos101 commented Jun 23, 2022

Uh oh!

Install a storage vendor which supports CSI snapshot in preview env #10201

Install a storage vendor which supports CSI snapshot in preview env #10201

Comments

jenting commented May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Is your feature request related to a problem? Please describe

Describe the behavior you'd like

Describe alternatives you've considered

Additional context

meysholdt commented May 24, 2022

Uh oh!

meysholdt commented May 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jenting commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jenting commented Jun 1, 2022

Uh oh!

jenting commented Jun 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylos101 commented Jun 7, 2022

Uh oh!

jenting commented Jun 8, 2022

Uh oh!

kylos101 commented Jun 8, 2022

Uh oh!

meysholdt commented Jun 10, 2022

Uh oh!

kylos101 commented Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jenting commented Jun 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylos101 commented Jun 12, 2022

Uh oh!

kylos101 commented Jun 13, 2022

Uh oh!

jenting commented Jun 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylos101 commented Jun 23, 2022

Uh oh!

jenting commented May 23, 2022 •

edited

Loading

meysholdt commented May 24, 2022 •

edited

Loading

jenting commented May 26, 2022 •

edited

Loading

jenting commented Jun 7, 2022 •

edited

Loading

kylos101 commented Jun 10, 2022 •

edited

Loading

jenting commented Jun 11, 2022 •

edited

Loading

jenting commented Jun 17, 2022 •

edited

Loading