-
Notifications
You must be signed in to change notification settings - Fork 159
Force timeout nodeunstagevolume #1918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Force timeout nodeunstagevolume #1918
Conversation
Welcome @davis-haba! |
Hi @davis-haba. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cc @pwschuurman |
d6e7c09
to
10719ad
Compare
/test all |
@davis-haba: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
@davis-haba: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test I'll let @pwschuurman comment on the PR. |
/ok-to-test |
/retest |
1 similar comment
/retest |
// Wait 35s (30s timeout + 5s buffer) and try again | ||
time.Sleep(35 * time.Second) | ||
err = client.NodeUnstageVolume(volID, stageDir) | ||
Expect(err).To(BeNil(), "Failed to unpublish after 30s in-use timeout for volume: %s, stageDir: %s", volID, stageDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should succeed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, unless I'm missing something, it does succeed. Expects err to be nil.
c7d97eb
to
6d78a36
Compare
/retest |
3c10de7
to
c7e8c05
Compare
cmd/gce-pd-csi-driver/main.go
Outdated
@@ -64,6 +64,9 @@ var ( | |||
waitForOpBackoffSteps = flag.Int("wait-op-backoff-steps", 100, "Steps for wait for operation backoff") | |||
waitForOpBackoffCap = flag.Duration("wait-op-backoff-cap", 0, "Cap for wait for operation backoff") | |||
|
|||
enableDeviceInUseTimeout = flag.Bool("enable-device-in-use-timeout", true, "If set to true, ignores device in use errors when attempting to unstage a device if it has been stuck for longer than 'device-in-use-timeout'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the boolean flag should bypass the "device in use" check entirely, and be renamed enable-device-in-use-check-on-node-unstage
. I don't see much value in having a boolean for using the timeout or not (eg: if a user really wants a disable the timeout, they can set device-in-use-timeout
to infinity). I think the value of the switch more for users where the device in use check is blocking NodeUnstage due to the way /sys/fs/
is reacting, and wants to skip in entirely, for all disks.
In retrospect, this is something that I should have added in #1658, to allow the feature to be turned off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean. Fixed --the confirmDeviceUnused
in node.go will now always return nil if the flag is true.
pkg/gce-pd-csi-driver/node.go
Outdated
klog.Warningf("Unabled to check if device for %s is unused. Device has been unmounted successfully. Ignoring and continuing with unstaging. (%v)", volumeID, err) | ||
} else if ns.deviceInUseErrors.checkDeviceErrorTimeout(volumeID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some edge cases if a NodeUnstage that gets skipped due to force detach, or manual intervention. I don't know how often these could occur, and I think generally they're just theoretical problems:
- A pre-existing old deviceInUse, resulting in the device in use signal being ignored. I'm less concerned about this one, as worst case it wouldn't block a user's workload, and this is a best effort deviceInUse check. We could solve this with a TTL cache, but I don't see an existing reference implementation we could easily use.
- Device entries growing unbounded, if there are new unique volume IDs being added every now and then, and NodeUnstage gets skipped. In GCE the maximum number of concurrent disks is 128, so we could set a maximum cache entry to something larger, say 256 or 512 to bound the growth. This would allow for a fresh set of ~128 concurrent disks with room for some overlap. We could solve this with a LRU cache (github.com/hashicorp/golang-lru).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the LRU cache you suggested to address point 2.
Regarding point 1, it looks the library you mentioned also has a cache that will auto-expire keys (e.g. TTL): https://github.com/hashicorp/golang-lru?tab=readme-ov-file#expirable-lru-cache-example
Do we want to use this? What would a reasonable TTL be? maybe deviceInUseTimeout * 2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use this? What would a reasonable TTL be? maybe deviceInUseTimeout * 2?
Yeah, that seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I added the exirable cache.
c := expirable.NewLRU[string, time.Time](maxDeviceCacheSize, nil, timeout*2)
391038a
to
834b272
Compare
/retest |
devErrMap.mux.Lock() | ||
defer devErrMap.mux.Unlock() | ||
|
||
lastErrTime, exists := devErrMap.cache.Get(deviceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the cache hold the first error time, rather than the last? Maybe rename this to indicate something like "first time error encountered".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good call. Renamed to firstEncounteredErrTime
.
defer devErrMap.mux.Unlock() | ||
|
||
lastErrTime, exists := devErrMap.cache.Get(deviceName) | ||
return exists && currentTime().Sub(lastErrTime).Seconds() >= devErrMap.timeout.Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be more readable to compare Time
objects with After()
expirationTime := lastErrTime.Add(devErrMap.timeout)
return exists && currentTime().After(expirationTime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cea18ad
to
e3fc317
Compare
/retest |
1 similar comment
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: davis-haba, pwschuurman The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1a4056b
to
33c2d9e
Compare
…been released due to an error for more than 30 seconds
… specified to do so
33c2d9e
to
f15a491
Compare
@davis-haba: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/retest |
/lgtm |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Adds a 30 second timeout to NodeUnstageVolume. If a device has been in use for longer than 30 seconds, the next NodeUnstageVolume attempt to succeed.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
2 CMEK E2E tests are failing locally, but I believe this is due to test setup issues, as this change is entirely unrelated to CMEK.
Does this PR introduce a user-facing change?: