-
Notifications
You must be signed in to change notification settings - Fork 159
Race condition between csi-driver and GCP #1290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmm, arcus lifo queuing is the ultimate problem here. We'd fixed a bunch of these races with the error backoff (I think it was) but it seems there's still a few out there. I'm not sure what the right fix is TBH. Since the volume is never marked on the old node, the attacher won't know that it needs to be detached. |
A fix in arcus (the GCE/PD control plane) is actually in process of rolling out. This fix will enforce fifo of operations and will merge things like op 1 and op 2 in your example. The rollout should be complete in about a month. The workarounds we've discussed for this at the CSI layer all of various levels of hackery and danger around them, so I think it's best to just wait for the arcus fix. |
Thank you for this follow up! Glad to hear about arcus enforcing fifo soon 👍 |
qq: is the above fix meanwhile rolled out? |
The fix is currently rolling out. Should be complete within the next few weeks. |
Thanks @msau42 ! Please let us know in this issue when the fix rollout is complete. We continue to see the above-described issue in our GCP clusters. |
Is there a way for us to track this fix? Any issue or commit that we can look forward to? Thank you. |
There's no public tracker for the arcus rollout, unfortunately. There were some problems detected late last year that had to be fixed, and the final rollout is in progress now. |
Any update on the rollout of the fix ? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen |
@ialidzhikov: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Sorry for dropping this. The fix was rolled out by the end of January. Any race conditions seen recently are due to something else and may be worth looking into fixing in this driver. /close |
Thank you for the update! |
We frequently run into situations where a pod's volume cannot be attached to some node Y, because on GCP it is still attached to a node X where the pod was previously located. In K8s there are however no traces of the volume being attached to node X, specifically there is no
volumeattachment
resource mapping the volume to node X and node X'.status.volumesAttached/volumesInUse
has no signs of that volume; this indicates that it (at some point in time) was successfully detached from X.After a lot of digging (in gcp-csi-driver and GCP audit logs) I found the following race condition to happen presumably because there is no ordering of sequential operations or locking of ongoing operations happening, this is the ordered sequence of events:
gcp-operation-ID: 1
gcp-operation-ID: 2
)gcp-operation-ID: 3
)gcp-operation-ID: 1
(resurrected from the dead) finally succeeds; disk is attached to node X againThe text was updated successfully, but these errors were encountered: