Skip to content

VolumeAttachment is not able to detach from removed node after gcloud compute instances delete or gcloud compute instances simulate-maintenance-event command is run #987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dllegru opened this issue May 13, 2022 · 22 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@dllegru
Copy link

dllegru commented May 13, 2022

What happened:
After running any of the actions:

  • gcloud --project <project-id> compute instances delete <node-id-to-delete> --zone=<zone-id>
  • gcloud --project <project-id> compute instances simulate-maintenance-event <node-id-to-delete> --zone=<zone-id>
  • GCP triggers a preemption in a node running in a spot instance.

The following happens:

  • The selected <node-id-to-delete> gets removed from the GKE cluster as expected.
  • Pods that were running with a PVC attached in that removed node get evicted and scheduled into a new available node from the pool.
  • Pods get stuck initializing into the assigned node:
    • Status: Pending
    • State: Waiting
      • Reason: PodInitializing
    • Events:
   Warning  FailedMount  96s (x6 over 71m)   kubelet  Unable to attach or mount volumes: unmounted volumes=[infrastructure-prometheus], unattached volumes │
│ =[config config-out prometheus-infrastructure-rulefiles-0 kube-api-access-cvf76 tls-assets infrastructure-prometheus web-config]: timed out waiting for t │
│ he condition
  • VolumeAttachment still shows as attached:true to the previous node that was deleted/removed, with a detachError:
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  annotations:
    csi.alpha.kubernetes.io/node-id: projects/<project-id>/zones/europe-west1-d/instances/<node-id-to-delete>
  creationTimestamp: "2022-05-13T01:42:43Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-05-13T07:22:53Z"
  finalizers:
  - external-attacher/pd-csi-storage-gke-io
  name: csi-51371274f186e0e259e907a06cfe6d4d5ff27c2079a097caf29c883424efe9ee
  resourceVersion: "320247861"
  uid: 14d1f4c7-f05e-4eb0-9a6b-4a8e1463e8df
spec:
  attacher: pd.csi.storage.gke.io
  nodeName: <node-id-to-delete>
  source:
    persistentVolumeName: pvc-65e91226-24d6-4308-b35b-d29b2026ffff
status:
  attached: true
  detachError:
    message: 'rpc error: code = Unavailable desc = Request queued due to error condition
      on node'
    time: "2022-05-13T10:42:38Z"
  • Pod gets stuck permanently in init state. The only way to fix it is to manually edit the VolumeAttachment and delete the entry: finalizers: external-attacher/pd-csi-storage-gke-io

What you expected to happen:

VolumeAttachment is detached from non-existing node after node is deleted with gcloud cli commands either compute instances delete or compute instances simulate-maintenance-event

Environment:

  • GKE Rev: v1.22.8-gke.200
  • csi-node-driver-registrar: v2.5.0-gke.1
  • gcp-compute-persistent-disk-csi-driver: v1.5.1-gke.0
@dllegru
Copy link
Author

dllegru commented May 13, 2022

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 13, 2022
@dllegru dllegru changed the title after gcloud simulate-maintenance-event command VolumeAttachment not able to detach Volume from removed node after gcloud simulate-maintenance-event command, VolumeAttachment is not able to detach from removed node May 13, 2022
@dllegru dllegru changed the title after gcloud simulate-maintenance-event command, VolumeAttachment is not able to detach from removed node VolumeAttachment is not able to detach from removed node after gcloud simulate-maintenance-event command is run May 13, 2022
@dllegru dllegru changed the title VolumeAttachment is not able to detach from removed node after gcloud simulate-maintenance-event command is run VolumeAttachment is not able to detach from removed node after gcloud compute instances delete or gcloud compute instances simulate-maintenance-event command is run May 13, 2022
@mattcary
Copy link
Contributor

/cc @saikat-royc
/cc @amacaskill

This looks the same as the issue for #960 (and which is fixed by the same)

@mattcary
Copy link
Contributor

And #988

@saikat-royc
Copy link
Member

Yes it is. And should be fixed by #988

@himadrisingh
Copy link

It even happened when CA killed few of our nodes, new nodes failed.
Even with spot instances terminated by GCP, new nodes were not able to attach the same volume.

@mattcary
Copy link
Contributor

It even happened when CA killed few of our nodes, new nodes failed. Even with spot instances terminated by GCP, new nodes were not able to attach the same volume.

Yes, that's consistent with what we've seen too. The fix mentioned above should deal with that.

This was referenced May 17, 2022
@saikat-royc
Copy link
Member

Fixed by #988

@keramblock
Copy link

keramblock commented Jun 6, 2022

Hi, @saikat-royc could you provide any ETA about GA for this fix? right now I need to manually fix Prometheus pvc every time spot instance with it is revoked by GCP and I believe that I have same issue.

@kekscode
Copy link

kekscode commented Jun 6, 2022

@keramblock thank you for asking. I am also waiting and its a major pain (labor and cost wise) to mitigate the issue on a daily basis.

@dllegru
Copy link
Author

dllegru commented Jun 6, 2022

@keramblock @kekscode I've all my GKE clusters running onv1.22.8-gke.200 and GCP rolled back gcp-compute-persistent-disk-csi-driver to v1.3 and v1.4 a couple weeks ago, so I'm not having these issues anymore that were induced by v1.5.
Also what I understood from GCP is that they rolled it back everywhere so theoretically you should not have v1.5 of the pd-driver running on your gke clusters, in case you do I would suggest to open a support ticket with them.

@mattcary
Copy link
Contributor

mattcary commented Jun 6, 2022

+1 to @dllegru comments. We have rolled back the pd csi component to 1.3.4 on 1.22 clusters---look for the 0.11.4 component version stamped on the pdcsi-node pods (not the daemonset, just the pods).

The 0.11.7 component that has the 1.7.0 driver is enabled for newly created 1.22 clusters, and we're rolling out automatic upgrades over the next week.

@keramblock
Copy link

@mattcary thanks for the answer, will 1.23.5 also be updated?

@mattcary
Copy link
Contributor

mattcary commented Jun 6, 2022

We did not roll back the driver in 1.23 as there are new features in it that some customers are testing.

1.23 uses the 0.12 pdcsi component (yes, I know you were worried we wouldn't have enough different version numbers floating around... :-P). The 0.12.2 component is the bad one, the 0.12.4 component is the good one with the 1.7.0 driver. New 1.23 clusters should already be getting the new component; the auto-upgrades are being rolled out at the same time as for 1.22.

@kekscode
Copy link

kekscode commented Jun 6, 2022

@mattcary and @dllegru thanks for the clarification. I switched channels and currently i am on 1.23.6-gke.1500 because my idea was that the rapid channel would have the fix earlier than the normal channels. So i guess this turned out to be a dead end for me at the moment, right?

pdcsi-nodes have indeed the bad version:

annotations:                                                                                                                                                                         
  components.gke.io/component-name: pdcsi                                                                                                                                            
  components.gke.io/component-version: 0.12.2   

@mattcary
Copy link
Contributor

mattcary commented Jun 6, 2022

Yup :-/

@pdfrod
Copy link

pdfrod commented Jun 8, 2022

I'm on the same boat as @kekscode. Any estimate when 1.23 will get the fix? Or is there any way to force the correct driver version? I tried to edit the pdcsi-node DaemonSet to set the good driver version, but my changes get immediately reversed.

@dllegru
Copy link
Author

dllegru commented Jun 8, 2022

@kekscode @pdfrod in case it helps, as a possible workaround I was having in mind to apply but at the end didn't need to do is to actually disable Compute Engine persistent disk CSI Driver automatic management from GKE, then manually deploy the service into to the cluster, although never got to test it, theoretically it should be able to be done and there you can pin to the desired version of pd-csi driver.

@kekscode
Copy link

kekscode commented Jun 8, 2022

I played with the idea, too, @dllegru. Just a bit afraid i will run into other issues then and either run out of time for debugging those then.

Something else: In the web console in the cluster settings in „Details“ the CSI driver can be disabled and this should then use the „gcePersistentDisk in-tree volume plugin“. I don’t know enough about GCE/GKE to understand the implications, but could this be a workaround?

@mattcary
Copy link
Contributor

mattcary commented Jun 8, 2022

Something else: In the web console in the cluster settings in „Details“ the CSI driver can be disabled and this should then use the „gcePersistentDisk in-tree volume plugin“. I don’t know enough about GCE/GKE to understand the implications, but could this be a workaround?

No. In GKE 1.22+, CSI Migration is enabled, which means that the gce-pd provisioner uses the PD CSI driver as a back end. You either need to use the managed PD CSI driver or manually deploy one as @dllegru mentioned.

The upgrade config is rolling out and is about 15% done (it rolls out zone-by-zone of the course of a week). Unfortunately there's no good way to tell when your zone has the upgrade, until your master is able to upgrade.

@michaelniemand
Copy link

michaelniemand commented Jun 10, 2022

Something else: In the web console in the cluster settings in „Details“ the CSI driver can be disabled and this should then use the „gcePersistentDisk in-tree volume plugin“. I don’t know enough about GCE/GKE to understand the implications, but could this be a workaround?

@kekscode: I tried deactivating it which didn't work like @mattcary said. I then activated the feature again which appears to have fixed it for now.

Would be great if someone could verify these steps as a temporary workaround:

  1. deactivate Compute Engine persistent disk CSI Driver, wait for changes to propagate
  2. activate Compute Engine persistent disk CSI Driver, wait for changes to propagate
  3. wait for Pods to come back

it worked for me and all Pods came back. I deleted one of the volumeattachements manually (had to delete the finalizer) between step 1. and 2. but I'm not sure if this is related. You could see it still showed as being attached to a now non existing node

@kekscode
Copy link

I am on v1.24.0-gke.1801 now and the component version for CSI is 0.12.4 . So far no issues anymore with ephemeral nodes and volume attachments. I hope this works now.

@mattcary
Copy link
Contributor

Thanks for the data point @kekscode. We hope it works now too :-) It's useful to hear your confirmation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

9 participants