Race condition between csi-driver and GCP #1290

dguendisch · 2023-07-06T20:43:36Z

We frequently run into situations where a pod's volume cannot be attached to some node Y, because on GCP it is still attached to a node X where the pod was previously located. In K8s there are however no traces of the volume being attached to node X, specifically there is no volumeattachment resource mapping the volume to node X and node X' .status.volumesAttached/volumesInUse has no signs of that volume; this indicates that it (at some point in time) was successfully detached from X.

After a lot of digging (in gcp-csi-driver and GCP audit logs) I found the following race condition to happen presumably because there is no ordering of sequential operations or locking of ongoing operations happening, this is the ordered sequence of events:

csi-driver attaches disk to node X; gcp-csi-driver times out but GCP tracked the request gcp-operation-ID: 1
csi-driver attaches disk to node X again; this time it succeeds (gcp-operation-ID: 2)
pod gets rescheduled to another node Y about 2 mins later, so the volume must move from node X to node Y
csi-driver detaches disk from node X and succeeds (gcp-operation-ID: 3)
now gcp-operation-ID: 1 (resurrected from the dead) finally succeeds; disk is attached to node X again
csi-driver tries to attach disk to node Y (because of the pod reschedule) and never succeeds

The text was updated successfully, but these errors were encountered:

mattcary · 2023-08-03T19:58:33Z

Hmm, arcus lifo queuing is the ultimate problem here. We'd fixed a bunch of these races with the error backoff (I think it was) but it seems there's still a few out there.

I'm not sure what the right fix is TBH. Since the volume is never marked on the old node, the attacher won't know that it needs to be detached.

mattcary · 2023-08-03T20:49:49Z

A fix in arcus (the GCE/PD control plane) is actually in process of rolling out. This fix will enforce fifo of operations and will merge things like op 1 and op 2 in your example. The rollout should be complete in about a month.

The workarounds we've discussed for this at the CSI layer all of various levels of hackery and danger around them, so I think it's best to just wait for the arcus fix.

dguendisch · 2023-08-04T10:47:00Z

Thank you for this follow up! Glad to hear about arcus enforcing fifo soon 👍

dguendisch · 2023-10-04T10:11:44Z

qq: is the above fix meanwhile rolled out?

ialidzhikov · 2023-12-05T13:13:53Z

qq: is the above fix meanwhile rolled out?

@mattcary @msau42 any news about the above question?

ialidzhikov · 2023-12-09T15:28:31Z

qq: is the above fix meanwhile rolled out?

@mattcary @msau42 any news about the above question?

ping

msau42 · 2023-12-28T21:46:52Z

The fix is currently rolling out. Should be complete within the next few weeks.

ialidzhikov · 2024-01-03T14:03:58Z

Thanks @msau42 ! Please let us know in this issue when the fix rollout is complete. We continue to see the above-described issue in our GCP clusters.

jahantech · 2024-01-17T14:46:18Z

Is there a way for us to track this fix? Any issue or commit that we can look forward to?

Thank you.

mattcary · 2024-01-17T15:55:53Z

There's no public tracker for the arcus rollout, unfortunately. There were some problems detected late last year that had to be fixed, and the final rollout is in progress now.

adenitiu · 2024-04-06T08:14:12Z

Any update on the rollout of the fix ?

k8s-triage-robot · 2024-07-05T08:45:41Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-04T08:52:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-09-03T09:42:29Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-09-03T09:42:34Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ialidzhikov · 2024-09-03T10:41:20Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2024-09-03T10:41:24Z

@ialidzhikov: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ialidzhikov · 2024-09-03T10:42:19Z

@msau42 @mattcary can you confirm that the fix rollout in GCP is complete?

mattcary · 2024-09-03T20:14:47Z

Sorry for dropping this. The fix was rolled out by the end of January. Any race conditions seen recently are due to something else and may be worth looking into fixing in this driver.

/close

ialidzhikov · 2024-09-04T06:11:45Z

Thank you for the update!

dguendisch mentioned this issue Jul 6, 2023

DisableDevice not working as expected #1146

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 4, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 3, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition between csi-driver and GCP #1290

Race condition between csi-driver and GCP #1290

dguendisch commented Jul 6, 2023 •

edited

Loading

mattcary commented Aug 3, 2023

mattcary commented Aug 3, 2023

dguendisch commented Aug 4, 2023

dguendisch commented Oct 4, 2023

ialidzhikov commented Dec 5, 2023

ialidzhikov commented Dec 9, 2023

msau42 commented Dec 28, 2023

ialidzhikov commented Jan 3, 2024 •

edited

Loading

jahantech commented Jan 17, 2024 •

edited

Loading

mattcary commented Jan 17, 2024

adenitiu commented Apr 6, 2024

k8s-triage-robot commented Jul 5, 2024

k8s-triage-robot commented Aug 4, 2024

k8s-triage-robot commented Sep 3, 2024

k8s-ci-robot commented Sep 3, 2024

ialidzhikov commented Sep 3, 2024

k8s-ci-robot commented Sep 3, 2024

ialidzhikov commented Sep 3, 2024

mattcary commented Sep 3, 2024

ialidzhikov commented Sep 4, 2024

Race condition between csi-driver and GCP #1290

Race condition between csi-driver and GCP #1290

Comments

dguendisch commented Jul 6, 2023 • edited Loading

mattcary commented Aug 3, 2023

mattcary commented Aug 3, 2023

dguendisch commented Aug 4, 2023

dguendisch commented Oct 4, 2023

ialidzhikov commented Dec 5, 2023

ialidzhikov commented Dec 9, 2023

msau42 commented Dec 28, 2023

ialidzhikov commented Jan 3, 2024 • edited Loading

jahantech commented Jan 17, 2024 • edited Loading

mattcary commented Jan 17, 2024

adenitiu commented Apr 6, 2024

k8s-triage-robot commented Jul 5, 2024

k8s-triage-robot commented Aug 4, 2024

k8s-triage-robot commented Sep 3, 2024

k8s-ci-robot commented Sep 3, 2024

ialidzhikov commented Sep 3, 2024

k8s-ci-robot commented Sep 3, 2024

ialidzhikov commented Sep 3, 2024

mattcary commented Sep 3, 2024

ialidzhikov commented Sep 4, 2024

dguendisch commented Jul 6, 2023 •

edited

Loading

ialidzhikov commented Jan 3, 2024 •

edited

Loading

jahantech commented Jan 17, 2024 •

edited

Loading