-
Notifications
You must be signed in to change notification settings - Fork 159
Filter out GCE operation error codes caused by user errors #1261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter out GCE operation error codes caused by user errors #1261
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: amacaskill The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @saikat-royc |
2f66350
to
eeb01ef
Compare
eeb01ef
to
c87159c
Compare
c87159c
to
dbc1c43
Compare
dbc1c43
to
c0c3af1
Compare
"RESOURCE_IN_USE_BY_ANOTHER_RESOURCE": codes.InvalidArgument, | ||
"OPERATION_CANCELED_BY_USER": codes.Aborted, | ||
"QUOTA_EXCEEDED": codes.ResourceExhausted, | ||
"ZONE_RESOURCE_POOL_EXHAUSTED": codes.Unavailable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use codes.ResourceExhausted for ZONE_RESOURCE_POOL_EXHAUSTED, ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these codes, it looks like GCE maps it to the Unavailable grpc error code, so that is why I chose Unavailable. That being said, we don't filter out Unavailable fromm our SLO, so if we don't want these, then I should use resource exhausted. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We looked at this for the CreateVolume SLO as well and ultimately decided we would want to get notified if users were hitting an irregular amount of ZONE_RESOURCE_POOL_EXHAUSTED errors since they're not under their control. So +1 to Unavailable since it allows us to track it (or not) separately from other user-caused ResourceExhausted errors.
Also it looks like this PR accomplishes what I was doing in #1263 (classifying QUOTA_EXCEEDED as ResourceExhausted) so I'll close that - thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Please capture the reasoning in a comment in the code
@@ -856,14 +856,40 @@ func (cloud *CloudProvider) WaitForAttach(ctx context.Context, project string, v | |||
} | |||
|
|||
func wrapOpErr(name string, opErr *computev1.OperationErrorErrors) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what about error handling for cases where the operation is not even initiated, and we encounter error in say insertDisk, attachDisk where the WaitForOp is not called. e.g code
Also we may need to handle uncastable error (i.e errors.As cannot case to gce.UnsupportedDiskError) like the ones we saw in filestore. May need some manual testing here (1. insert disk when quota is exhausted, 2. attachdisk when disk already attached) or we can address them lazily if we encounter them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have looked into quiet a few clusters, and I haven't seen any errors that are incorrectly returned as an Internal error at this point ( where the operation is not even initiated). I will look into some more clusters, but I have probably looked into at least 20 at this point, and I haven't seen errors like that.
Also we may need to handle uncastable error (i.e errors.As cannot cast to gce.UnsupportedDiskError) like the ones we saw in filestore.
I'm a little confused by this comment. Are you saying we could be missing some cases where we need to return an UnsupportedDiskError (which we do when the operation error has the UNSUPPORTED_OPERATION code). This won't happen like in filestore because wrapOpErr passes in an operation error of type computev1.OperationErrorErrors directly. This error type doesn't actually implement the error interface, so this will never happen. (errors.As cannot even be run on the operation error of type computev1.OperationErrorErrors since it doesn't implement error interface).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I missed the OperationErrorErrors type.
And yes, for unsupported operations, I think we would return kIternal for the UnsupportedDiskError. Until we repro this scenario, it would be difficult to handle this error. So that would be a followup for future to understand the sceario better.
As discussed, lets capture the known issues in some task:
- AttachDisk errors not handled by errors.As() code
- InsertDisk, DeleteDisk, DetachDisk errors are not handled.
/lgtm |
/cherry-pick release-1.10 |
@amacaskill: new pull request created: #1292 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-1.9 |
@judemars: new pull request created: #1322 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-1.8 |
@amacaskill: new pull request created: #1475 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-1.7 |
@amacaskill: new pull request created: #1476 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Filter out GCE operation error codes caused by user errors. We are getting lots of errors counting against our SLO for GCE operation errors that are user errors. We have seen Internal errors returned from the following GCE operation error codes:
OPERATION_CANCELED_BY_USER
,RESOURCE_NOT_FOUND
,RESOURCE_IN_USE_BY_ANOTHER_RESOURCE
,INVALID_USAGE
. I added a filtered out other error codes (quota error codes) which we haven't observed yet, but those are clearly user error codes. The errors that we have seen so far look like this:Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: