-
Notifications
You must be signed in to change notification settings - Fork 159
Filter out GCE operation error codes caused by user errors #1261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -856,14 +856,40 @@ func (cloud *CloudProvider) WaitForAttach(ctx context.Context, project string, v | |
} | ||
|
||
func wrapOpErr(name string, opErr *computev1.OperationErrorErrors) error { | ||
if opErr == nil { | ||
return nil | ||
} | ||
|
||
if opErr.Code == "UNSUPPORTED_OPERATION" { | ||
if diskType := pdDiskTypeUnsupportedRegex.FindStringSubmatch(opErr.Message); diskType != nil { | ||
return &UnsupportedDiskError{ | ||
DiskType: diskType[1], | ||
} | ||
} | ||
} | ||
return fmt.Errorf("operation %v failed (%v): %v", name, opErr.Code, opErr.Message) | ||
grpcErrCode := codeForGCEOpError(*opErr) | ||
return status.Errorf(grpcErrCode, "operation %v failed (%v): %v", name, opErr.Code, opErr.Message) | ||
} | ||
|
||
// codeForGCEOpError return the grpc error code for the passed in | ||
// gce operation error. | ||
func codeForGCEOpError(err computev1.OperationErrorErrors) codes.Code { | ||
userErrors := map[string]codes.Code{ | ||
"RESOURCE_NOT_FOUND": codes.NotFound, | ||
"RESOURCE_ALREADY_EXISTS": codes.AlreadyExists, | ||
"RESOURCE_IN_USE_BY_ANOTHER_RESOURCE": codes.InvalidArgument, | ||
"OPERATION_CANCELED_BY_USER": codes.Aborted, | ||
"QUOTA_EXCEEDED": codes.ResourceExhausted, | ||
"ZONE_RESOURCE_POOL_EXHAUSTED": codes.Unavailable, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we use codes.ResourceExhausted for ZONE_RESOURCE_POOL_EXHAUSTED, ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS as well There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For these codes, it looks like GCE maps it to the Unavailable grpc error code, so that is why I chose Unavailable. That being said, we don't filter out Unavailable fromm our SLO, so if we don't want these, then I should use resource exhausted. What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We looked at this for the CreateVolume SLO as well and ultimately decided we would want to get notified if users were hitting an irregular amount of ZONE_RESOURCE_POOL_EXHAUSTED errors since they're not under their control. So +1 to Unavailable since it allows us to track it (or not) separately from other user-caused ResourceExhausted errors. Also it looks like this PR accomplishes what I was doing in #1263 (classifying QUOTA_EXCEEDED as ResourceExhausted) so I'll close that - thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack. Please capture the reasoning in a comment in the code |
||
"ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS": codes.Unavailable, | ||
"REGION_QUOTA_EXCEEDED": codes.ResourceExhausted, | ||
"RATE_LIMIT_EXCEEDED": codes.ResourceExhausted, | ||
"INVALID_USAGE": codes.InvalidArgument, | ||
} | ||
if code, ok := userErrors[err.Code]; ok { | ||
return code | ||
} | ||
return codes.Internal | ||
} | ||
|
||
func opIsDone(op *computev1.Operation) (bool, error) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what about error handling for cases where the operation is not even initiated, and we encounter error in say insertDisk, attachDisk where the WaitForOp is not called. e.g code
Also we may need to handle uncastable error (i.e errors.As cannot case to gce.UnsupportedDiskError) like the ones we saw in filestore. May need some manual testing here (1. insert disk when quota is exhausted, 2. attachdisk when disk already attached) or we can address them lazily if we encounter them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have looked into quiet a few clusters, and I haven't seen any errors that are incorrectly returned as an Internal error at this point ( where the operation is not even initiated). I will look into some more clusters, but I have probably looked into at least 20 at this point, and I haven't seen errors like that.
I'm a little confused by this comment. Are you saying we could be missing some cases where we need to return an UnsupportedDiskError (which we do when the operation error has the UNSUPPORTED_OPERATION code). This won't happen like in filestore because wrapOpErr passes in an operation error of type computev1.OperationErrorErrors directly. This error type doesn't actually implement the error interface, so this will never happen. (errors.As cannot even be run on the operation error of type computev1.OperationErrorErrors since it doesn't implement error interface).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I missed the OperationErrorErrors type.
And yes, for unsupported operations, I think we would return kIternal for the UnsupportedDiskError. Until we repro this scenario, it would be difficult to handle this error. So that would be a followup for future to understand the sceario better.
As discussed, lets capture the known issues in some task: