-
Notifications
You must be signed in to change notification settings - Fork 159
VolumeAttachment
is not able to detach from removed node after gcloud compute instances delete
or gcloud compute instances simulate-maintenance-event
command is run
#987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/kind bug |
gcloud simulate-maintenance-event
command VolumeAttachment
not able to detach Volume from removed nodegcloud simulate-maintenance-event
command, VolumeAttachment
is not able to detach from removed node
gcloud simulate-maintenance-event
command, VolumeAttachment
is not able to detach from removed nodeVolumeAttachment
is not able to detach from removed node after gcloud simulate-maintenance-event
command is run
VolumeAttachment
is not able to detach from removed node after gcloud simulate-maintenance-event
command is runVolumeAttachment
is not able to detach from removed node after gcloud compute instances delete
or gcloud compute instances simulate-maintenance-event
command is run
/cc @saikat-royc This looks the same as the issue for #960 (and which is fixed by the same) |
And #988 |
Yes it is. And should be fixed by #988 |
It even happened when CA killed few of our nodes, new nodes failed. |
Yes, that's consistent with what we've seen too. The fix mentioned above should deal with that. |
Fixed by #988 |
Hi, @saikat-royc could you provide any ETA about GA for this fix? right now I need to manually fix Prometheus pvc every time spot instance with it is revoked by GCP and I believe that I have same issue. |
@keramblock thank you for asking. I am also waiting and its a major pain (labor and cost wise) to mitigate the issue on a daily basis. |
@keramblock @kekscode I've all my GKE clusters running on |
+1 to @dllegru comments. We have rolled back the pd csi component to 1.3.4 on 1.22 clusters---look for the 0.11.4 component version stamped on the pdcsi-node pods (not the daemonset, just the pods). The 0.11.7 component that has the 1.7.0 driver is enabled for newly created 1.22 clusters, and we're rolling out automatic upgrades over the next week. |
@mattcary thanks for the answer, will 1.23.5 also be updated? |
We did not roll back the driver in 1.23 as there are new features in it that some customers are testing. 1.23 uses the 0.12 pdcsi component (yes, I know you were worried we wouldn't have enough different version numbers floating around... :-P). The 0.12.2 component is the bad one, the 0.12.4 component is the good one with the 1.7.0 driver. New 1.23 clusters should already be getting the new component; the auto-upgrades are being rolled out at the same time as for 1.22. |
@mattcary and @dllegru thanks for the clarification. I switched channels and currently i am on pdcsi-nodes have indeed the bad version: annotations: │
components.gke.io/component-name: pdcsi
components.gke.io/component-version: 0.12.2 |
Yup :-/ |
I'm on the same boat as @kekscode. Any estimate when 1.23 will get the fix? Or is there any way to force the correct driver version? I tried to edit the pdcsi-node DaemonSet to set the good driver version, but my changes get immediately reversed. |
@kekscode @pdfrod in case it helps, as a possible workaround I was having in mind to apply but at the end didn't need to do is to actually disable |
I played with the idea, too, @dllegru. Just a bit afraid i will run into other issues then and either run out of time for debugging those then. Something else: In the web console in the cluster settings in „Details“ the CSI driver can be disabled and this should then use the „gcePersistentDisk in-tree volume plugin“. I don’t know enough about GCE/GKE to understand the implications, but could this be a workaround? |
No. In GKE 1.22+, CSI Migration is enabled, which means that the gce-pd provisioner uses the PD CSI driver as a back end. You either need to use the managed PD CSI driver or manually deploy one as @dllegru mentioned. The upgrade config is rolling out and is about 15% done (it rolls out zone-by-zone of the course of a week). Unfortunately there's no good way to tell when your zone has the upgrade, until your master is able to upgrade. |
@kekscode: I tried deactivating it which didn't work like @mattcary said. I then activated the feature again which appears to have fixed it for now. Would be great if someone could verify these steps as a temporary workaround:
it worked for me and all Pods came back. I deleted one of the volumeattachements manually (had to delete the finalizer) between step 1. and 2. but I'm not sure if this is related. You could see it still showed as being attached to a now non existing node |
I am on v1.24.0-gke.1801 now and the component version for CSI is 0.12.4 . So far no issues anymore with ephemeral nodes and volume attachments. I hope this works now. |
Thanks for the data point @kekscode. We hope it works now too :-) It's useful to hear your confirmation. |
What happened:
After running any of the actions:
gcloud --project <project-id> compute instances delete <node-id-to-delete> --zone=<zone-id>
gcloud --project <project-id> compute instances simulate-maintenance-event <node-id-to-delete> --zone=<zone-id>
preemption
in a node running in aspot instance
.The following happens:
<node-id-to-delete>
gets removed from the GKE cluster as expected.Pods
get stuck initializing into the assigned node:VolumeAttachment
still shows asattached:true
to the previous node that was deleted/removed, with adetachError
:Pod
gets stuck permanently in init state. The only way to fix it is to manually edit theVolumeAttachment
and delete the entry:finalizers: external-attacher/pd-csi-storage-gke-io
What you expected to happen:
VolumeAttachment
is detached from non-existing node after node is deleted withgcloud cli
commands eithercompute instances delete
orcompute instances simulate-maintenance-event
Environment:
v1.22.8-gke.200
v2.5.0-gke.1
v1.5.1-gke.0
The text was updated successfully, but these errors were encountered: