Missing disks causes nodes to stick due to backoff #1029

mattcary · 2022-07-23T01:09:31Z

Around 07/13 issue started when a Pod attempted to use a non-existent volume we'll denote by X (this was a statically created PD, not dynamically provisioned from the driver). Presumably it existed at some point, because it was correctly attached to a node, and the PV, PVC and volume attachment (VA) still exist.

The VA csi-966b5b879f8fb9137a305680db1aff4e38231718d4dbc7883410d7a58530f161 was created pointing to that volume and a node Y. This attachment failed due to a missing disk error and put a backoff condition on the node.

This pod was deleted, adding a delete timestamp to the VA object. This triggered detach workflow in csi-attacher. Detach keeps failing infinitely because of missing disk error (and keeping the node in a backoff condition).

At the same time another pod & PVC were scheduled to the node, but the repeated detach attempts happen too fast for the attach to ever succeed; instead it kept hitting the backoff.

We've decided on two actions. One, to be tracked with this issue, is to key the backoff by both disk and node. See the PR about to mention this issue.

The other, which will be tracked in a different issue, will be to assume a missing disk has been detached if it remains not found for a long enough time. Watch for the new issue.

mattcary mentioned this issue Jul 23, 2022

backoff per {node,disk} pair instead of just node} #1028

Merged

k8s-ci-robot closed this as completed in #1028 Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing disks causes nodes to stick due to backoff #1029

Missing disks causes nodes to stick due to backoff #1029

mattcary commented Jul 23, 2022

Missing disks causes nodes to stick due to backoff #1029

Missing disks causes nodes to stick due to backoff #1029

Comments

mattcary commented Jul 23, 2022