-
Notifications
You must be signed in to change notification settings - Fork 159
Filter multiattach errors #1559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -374,6 +377,17 @@ func isContextError(err error) (codes.Code, error) { | |||
return codes.Unknown, fmt.Errorf("Not a context error: %w", err) | |||
} | |||
|
|||
// isUserMultiAttachError returns an InvalidArgument if the error is | |||
// multi-attach detected from the API server. If we get this error from the API | |||
// server, it means that the kubelet doesn't know about the multiattch so it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is possible there could be a race condition in K8s that also triggers this.
For example, with StatefulSet, the replacement Pod is created with the same name when the old Pod is deleted. Pod deletion is blocked on pod-volume unmounting, but not node-level unmount or detach. So a replacement Pod can be created before we have successfully detached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in that cause the kubelet knows the volume is still attached and so the controller will figure out not to attach? https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L341
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think that the only time I've seen this error from GCP is when the user has made two static PVs that refer to the same disk --- at least that's the case in the current SLOs that are firing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, in the race condition I am thinking of, ADC prevents the attach call from getting down to the CSI driver. So filtering the error at the CSI driver level is fine.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mattcary, msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-1.12 |
@mattcary: new pull request created: #1560 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
What this PR does / why we need it:
User misconfiguration causing multiattach errors clouds up our SLO.
/assign @msau42