`VolumeAttachment` is not able to detach from removed node after `gcloud compute instances delete` or `gcloud compute instances simulate-maintenance-event` command is run #987

dllegru · 2022-05-13T10:52:23Z

What happened:
After running any of the actions:

gcloud --project <project-id> compute instances delete <node-id-to-delete> --zone=<zone-id>
gcloud --project <project-id> compute instances simulate-maintenance-event <node-id-to-delete> --zone=<zone-id>
GCP triggers a preemption in a node running in a spot instance.

The following happens:

The selected <node-id-to-delete> gets removed from the GKE cluster as expected.
Pods that were running with a PVC attached in that removed node get evicted and scheduled into a new available node from the pool.
Pods get stuck initializing into the assigned node:
- Status: Pending
- State: Waiting
  - Reason: PodInitializing
- Events:

   Warning  FailedMount  96s (x6 over 71m)   kubelet  Unable to attach or mount volumes: unmounted volumes=[infrastructure-prometheus], unattached volumes │
│ =[config config-out prometheus-infrastructure-rulefiles-0 kube-api-access-cvf76 tls-assets infrastructure-prometheus web-config]: timed out waiting for t │
│ he condition

VolumeAttachment still shows as attached:true to the previous node that was deleted/removed, with a detachError:

apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  annotations:
    csi.alpha.kubernetes.io/node-id: projects/<project-id>/zones/europe-west1-d/instances/<node-id-to-delete>
  creationTimestamp: "2022-05-13T01:42:43Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-05-13T07:22:53Z"
  finalizers:
  - external-attacher/pd-csi-storage-gke-io
  name: csi-51371274f186e0e259e907a06cfe6d4d5ff27c2079a097caf29c883424efe9ee
  resourceVersion: "320247861"
  uid: 14d1f4c7-f05e-4eb0-9a6b-4a8e1463e8df
spec:
  attacher: pd.csi.storage.gke.io
  nodeName: <node-id-to-delete>
  source:
    persistentVolumeName: pvc-65e91226-24d6-4308-b35b-d29b2026ffff
status:
  attached: true
  detachError:
    message: 'rpc error: code = Unavailable desc = Request queued due to error condition
      on node'
    time: "2022-05-13T10:42:38Z"

Pod gets stuck permanently in init state. The only way to fix it is to manually edit the VolumeAttachment and delete the entry: finalizers: external-attacher/pd-csi-storage-gke-io

What you expected to happen:

VolumeAttachment is detached from non-existing node after node is deleted with gcloud cli commands either compute instances delete or compute instances simulate-maintenance-event

Environment:

GKE Rev: v1.22.8-gke.200
csi-node-driver-registrar: v2.5.0-gke.1
gcp-compute-persistent-disk-csi-driver: v1.5.1-gke.0

The text was updated successfully, but these errors were encountered:

dllegru · 2022-05-13T10:54:53Z

/kind bug

mattcary · 2022-05-17T00:22:51Z

/cc @saikat-royc
/cc @amacaskill

This looks the same as the issue for #960 (and which is fixed by the same)

mattcary · 2022-05-17T02:16:10Z

And #988

saikat-royc · 2022-05-17T02:24:28Z

Yes it is. And should be fixed by #988

himadrisingh · 2022-05-17T06:54:02Z

It even happened when CA killed few of our nodes, new nodes failed.
Even with spot instances terminated by GCP, new nodes were not able to attach the same volume.

mattcary · 2022-05-17T15:36:21Z

It even happened when CA killed few of our nodes, new nodes failed. Even with spot instances terminated by GCP, new nodes were not able to attach the same volume.

Yes, that's consistent with what we've seen too. The fix mentioned above should deal with that.

saikat-royc · 2022-05-19T20:09:54Z

Fixed by #988

keramblock · 2022-06-06T11:52:14Z

Hi, @saikat-royc could you provide any ETA about GA for this fix? right now I need to manually fix Prometheus pvc every time spot instance with it is revoked by GCP and I believe that I have same issue.

kekscode · 2022-06-06T15:16:39Z

@keramblock thank you for asking. I am also waiting and its a major pain (labor and cost wise) to mitigate the issue on a daily basis.

dllegru · 2022-06-06T15:41:30Z

@keramblock @kekscode I've all my GKE clusters running onv1.22.8-gke.200 and GCP rolled back gcp-compute-persistent-disk-csi-driver to v1.3 and v1.4 a couple weeks ago, so I'm not having these issues anymore that were induced by v1.5.
Also what I understood from GCP is that they rolled it back everywhere so theoretically you should not have v1.5 of the pd-driver running on your gke clusters, in case you do I would suggest to open a support ticket with them.

mattcary · 2022-06-06T15:47:58Z

+1 to @dllegru comments. We have rolled back the pd csi component to 1.3.4 on 1.22 clusters---look for the 0.11.4 component version stamped on the pdcsi-node pods (not the daemonset, just the pods).

The 0.11.7 component that has the 1.7.0 driver is enabled for newly created 1.22 clusters, and we're rolling out automatic upgrades over the next week.

keramblock · 2022-06-06T15:55:39Z

@mattcary thanks for the answer, will 1.23.5 also be updated?

mattcary · 2022-06-06T16:04:25Z

We did not roll back the driver in 1.23 as there are new features in it that some customers are testing.

1.23 uses the 0.12 pdcsi component (yes, I know you were worried we wouldn't have enough different version numbers floating around... :-P). The 0.12.2 component is the bad one, the 0.12.4 component is the good one with the 1.7.0 driver. New 1.23 clusters should already be getting the new component; the auto-upgrades are being rolled out at the same time as for 1.22.

kekscode · 2022-06-06T18:49:13Z

@mattcary and @dllegru thanks for the clarification. I switched channels and currently i am on 1.23.6-gke.1500 because my idea was that the rapid channel would have the fix earlier than the normal channels. So i guess this turned out to be a dead end for me at the moment, right?

pdcsi-nodes have indeed the bad version:

annotations:                                                                                                                                                                         │
  components.gke.io/component-name: pdcsi                                                                                                                                            
  components.gke.io/component-version: 0.12.2

mattcary · 2022-06-06T19:20:26Z

Yup :-/

pdfrod · 2022-06-08T08:35:38Z

I'm on the same boat as @kekscode. Any estimate when 1.23 will get the fix? Or is there any way to force the correct driver version? I tried to edit the pdcsi-node DaemonSet to set the good driver version, but my changes get immediately reversed.

dllegru · 2022-06-08T13:43:55Z

@kekscode @pdfrod in case it helps, as a possible workaround I was having in mind to apply but at the end didn't need to do is to actually disable Compute Engine persistent disk CSI Driver automatic management from GKE, then manually deploy the service into to the cluster, although never got to test it, theoretically it should be able to be done and there you can pin to the desired version of pd-csi driver.

kekscode · 2022-06-08T15:05:06Z

I played with the idea, too, @dllegru. Just a bit afraid i will run into other issues then and either run out of time for debugging those then.

Something else: In the web console in the cluster settings in „Details“ the CSI driver can be disabled and this should then use the „gcePersistentDisk in-tree volume plugin“. I don’t know enough about GCE/GKE to understand the implications, but could this be a workaround?

mattcary · 2022-06-08T15:59:13Z

Something else: In the web console in the cluster settings in „Details“ the CSI driver can be disabled and this should then use the „gcePersistentDisk in-tree volume plugin“. I don’t know enough about GCE/GKE to understand the implications, but could this be a workaround?

No. In GKE 1.22+, CSI Migration is enabled, which means that the gce-pd provisioner uses the PD CSI driver as a back end. You either need to use the managed PD CSI driver or manually deploy one as @dllegru mentioned.

The upgrade config is rolling out and is about 15% done (it rolls out zone-by-zone of the course of a week). Unfortunately there's no good way to tell when your zone has the upgrade, until your master is able to upgrade.

michaelniemand · 2022-06-10T11:28:58Z

Something else: In the web console in the cluster settings in „Details“ the CSI driver can be disabled and this should then use the „gcePersistentDisk in-tree volume plugin“. I don’t know enough about GCE/GKE to understand the implications, but could this be a workaround?

@kekscode: I tried deactivating it which didn't work like @mattcary said. I then activated the feature again which appears to have fixed it for now.

Would be great if someone could verify these steps as a temporary workaround:

deactivate Compute Engine persistent disk CSI Driver, wait for changes to propagate
activate Compute Engine persistent disk CSI Driver, wait for changes to propagate
wait for Pods to come back

it worked for me and all Pods came back. I deleted one of the volumeattachements manually (had to delete the finalizer) between step 1. and 2. but I'm not sure if this is related. You could see it still showed as being attached to a now non existing node

kekscode · 2022-06-18T19:58:00Z

I am on v1.24.0-gke.1801 now and the component version for CSI is 0.12.4 . So far no issues anymore with ephemeral nodes and volume attachments. I hope this works now.

mattcary · 2022-06-21T16:32:31Z

Thanks for the data point @kekscode. We hope it works now too :-) It's useful to hear your confirmation.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 13, 2022

dllegru changed the title ~~after gcloud simulate-maintenance-event command VolumeAttachment not able to detach Volume from removed node~~ after gcloud simulate-maintenance-event command, VolumeAttachment is not able to detach from removed node May 13, 2022

dllegru changed the title ~~after gcloud simulate-maintenance-event command, VolumeAttachment is not able to detach from removed node~~ VolumeAttachment is not able to detach from removed node after gcloud simulate-maintenance-event command is run May 13, 2022

This was referenced May 17, 2022

Disk failed to relink with udevadm #608

Closed

CHANGELOG for 1.7 #994

Merged

saikat-royc closed this as completed May 19, 2022

mattcary mentioned this issue May 23, 2022

Some volumeattachments.storage.k8s.io are not cleaned, PV cannot be removed #1001

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`VolumeAttachment` is not able to detach from removed node after `gcloud compute instances delete` or `gcloud compute instances simulate-maintenance-event` command is run #987

`VolumeAttachment` is not able to detach from removed node after `gcloud compute instances delete` or `gcloud compute instances simulate-maintenance-event` command is run #987

dllegru commented May 13, 2022 •

edited

Loading

dllegru commented May 13, 2022

mattcary commented May 17, 2022

mattcary commented May 17, 2022

saikat-royc commented May 17, 2022

himadrisingh commented May 17, 2022

mattcary commented May 17, 2022

saikat-royc commented May 19, 2022

keramblock commented Jun 6, 2022 •

edited

Loading

kekscode commented Jun 6, 2022

dllegru commented Jun 6, 2022

mattcary commented Jun 6, 2022

keramblock commented Jun 6, 2022

mattcary commented Jun 6, 2022

kekscode commented Jun 6, 2022 •

edited

Loading

mattcary commented Jun 6, 2022

pdfrod commented Jun 8, 2022

dllegru commented Jun 8, 2022 •

edited

Loading

kekscode commented Jun 8, 2022

mattcary commented Jun 8, 2022

michaelniemand commented Jun 10, 2022 •

edited

Loading

kekscode commented Jun 18, 2022

mattcary commented Jun 21, 2022

VolumeAttachment is not able to detach from removed node after gcloud compute instances delete or gcloud compute instances simulate-maintenance-event command is run #987

VolumeAttachment is not able to detach from removed node after gcloud compute instances delete or gcloud compute instances simulate-maintenance-event command is run #987

Comments

dllegru commented May 13, 2022 • edited Loading

dllegru commented May 13, 2022

mattcary commented May 17, 2022

mattcary commented May 17, 2022

saikat-royc commented May 17, 2022

himadrisingh commented May 17, 2022

mattcary commented May 17, 2022

saikat-royc commented May 19, 2022

keramblock commented Jun 6, 2022 • edited Loading

kekscode commented Jun 6, 2022

dllegru commented Jun 6, 2022

mattcary commented Jun 6, 2022

keramblock commented Jun 6, 2022

mattcary commented Jun 6, 2022

kekscode commented Jun 6, 2022 • edited Loading

mattcary commented Jun 6, 2022

pdfrod commented Jun 8, 2022

dllegru commented Jun 8, 2022 • edited Loading

kekscode commented Jun 8, 2022

mattcary commented Jun 8, 2022

michaelniemand commented Jun 10, 2022 • edited Loading

kekscode commented Jun 18, 2022

mattcary commented Jun 21, 2022

`VolumeAttachment` is not able to detach from removed node after `gcloud compute instances delete` or `gcloud compute instances simulate-maintenance-event` command is run #987

`VolumeAttachment` is not able to detach from removed node after `gcloud compute instances delete` or `gcloud compute instances simulate-maintenance-event` command is run #987

dllegru commented May 13, 2022 •

edited

Loading

keramblock commented Jun 6, 2022 •

edited

Loading

kekscode commented Jun 6, 2022 •

edited

Loading

dllegru commented Jun 8, 2022 •

edited

Loading

michaelniemand commented Jun 10, 2022 •

edited

Loading