Filter out GCE operation error codes caused by user errors #1261

amacaskill · 2023-06-14T17:03:32Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug

/kind cleanup

/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
Filter out GCE operation error codes caused by user errors. We are getting lots of errors counting against our SLO for GCE operation errors that are user errors. We have seen Internal errors returned from the following GCE operation error codes: OPERATION_CANCELED_BY_USER, RESOURCE_NOT_FOUND, RESOURCE_IN_USE_BY_ANOTHER_RESOURCE, INVALID_USAGE. I added a filtered out other error codes (quota error codes) which we haven't observed yet, but those are clearly user error codes. The errors that we have seen so far look like this:

ControllerPublishVolume returned with error: rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-xxx failed (OPERATION_CANCELED_BY_USER): Operation was canceled by user ''.

ControllerPublishVolume returned with error: rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-xxx failed (RESOURCE_NOT_FOUND): The resource 'projects/xxx/zones/xxx/instances/xxx' was not found

ControllerPublishVolume returned with error: rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-xxx failed (RESOURCE_IN_USE_BY_ANOTHER_RESOURCE): The disk resource 'projects/xxx/zones/xxx/disks/xxx' is already being used by 'projects/xxx/zones/xxx/instances/xxx'

ControllerUnpublishVolume returned with error: rpc error: code = Internal desc = unknown detach error: operation operation-xxx failed (INVALID_USAGE): No attached disk found with device name 'pvc-xxx'

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

None

k8s-ci-robot · 2023-06-14T17:03:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amacaskill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [amacaskill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

amacaskill · 2023-06-14T17:03:46Z

/assign @saikat-royc

saikat-royc · 2023-06-14T23:14:05Z

pkg/gce-cloud-provider/compute/gce-compute.go

+		"RESOURCE_IN_USE_BY_ANOTHER_RESOURCE":       codes.InvalidArgument,
+		"OPERATION_CANCELED_BY_USER":                codes.Aborted,
+		"QUOTA_EXCEEDED":                            codes.ResourceExhausted,
+		"ZONE_RESOURCE_POOL_EXHAUSTED":              codes.Unavailable,


should we use codes.ResourceExhausted for ZONE_RESOURCE_POOL_EXHAUSTED, ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS as well

For these codes, it looks like GCE maps it to the Unavailable grpc error code, so that is why I chose Unavailable. That being said, we don't filter out Unavailable fromm our SLO, so if we don't want these, then I should use resource exhausted. What do you think?

We looked at this for the CreateVolume SLO as well and ultimately decided we would want to get notified if users were hitting an irregular amount of ZONE_RESOURCE_POOL_EXHAUSTED errors since they're not under their control. So +1 to Unavailable since it allows us to track it (or not) separately from other user-caused ResourceExhausted errors.

Also it looks like this PR accomplishes what I was doing in #1263 (classifying QUOTA_EXCEEDED as ResourceExhausted) so I'll close that - thanks!

Ack. Please capture the reasoning in a comment in the code

saikat-royc · 2023-06-14T23:21:50Z

pkg/gce-cloud-provider/compute/gce-compute.go

@@ -856,14 +856,40 @@ func (cloud *CloudProvider) WaitForAttach(ctx context.Context, project string, v
 }

 func wrapOpErr(name string, opErr *computev1.OperationErrorErrors) error {


Also what about error handling for cases where the operation is not even initiated, and we encounter error in say insertDisk, attachDisk where the WaitForOp is not called. e.g code

Also we may need to handle uncastable error (i.e errors.As cannot case to gce.UnsupportedDiskError) like the ones we saw in filestore. May need some manual testing here (1. insert disk when quota is exhausted, 2. attachdisk when disk already attached) or we can address them lazily if we encounter them.

I have looked into quiet a few clusters, and I haven't seen any errors that are incorrectly returned as an Internal error at this point ( where the operation is not even initiated). I will look into some more clusters, but I have probably looked into at least 20 at this point, and I haven't seen errors like that.

Also we may need to handle uncastable error (i.e errors.As cannot cast to gce.UnsupportedDiskError) like the ones we saw in filestore.

I'm a little confused by this comment. Are you saying we could be missing some cases where we need to return an UnsupportedDiskError (which we do when the operation error has the UNSUPPORTED_OPERATION code). This won't happen like in filestore because wrapOpErr passes in an operation error of type computev1.OperationErrorErrors directly. This error type doesn't actually implement the error interface, so this will never happen. (errors.As cannot even be run on the operation error of type computev1.OperationErrorErrors since it doesn't implement error interface).

Ack. I missed the OperationErrorErrors type.

And yes, for unsupported operations, I think we would return kIternal for the UnsupportedDiskError. Until we repro this scenario, it would be difficult to handle this error. So that would be a followup for future to understand the sceario better.

As discussed, lets capture the known issues in some task:

AttachDisk errors not handled by errors.As() code

InsertDisk, DeleteDisk, DetachDisk errors are not handled.

saikat-royc · 2023-07-11T20:19:11Z

/lgtm

amacaskill · 2023-07-11T22:53:41Z

/cherry-pick release-1.10

k8s-infra-cherrypick-robot · 2023-07-11T22:54:22Z

@amacaskill: new pull request created: #1292

In response to this:

/cherry-pick release-1.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

judemars · 2023-08-02T20:51:30Z

/cherry-pick release-1.9

k8s-infra-cherrypick-robot · 2023-08-02T20:52:08Z

@judemars: new pull request created: #1322

In response to this:

/cherry-pick release-1.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

amacaskill · 2023-10-31T01:16:55Z

/cherry-pick release-1.8

k8s-infra-cherrypick-robot · 2023-10-31T01:17:32Z

@amacaskill: new pull request created: #1475

In response to this:

/cherry-pick release-1.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

amacaskill · 2023-10-31T01:19:09Z

/cherry-pick release-1.7

k8s-infra-cherrypick-robot · 2023-10-31T01:19:47Z

@amacaskill: new pull request created: #1476

In response to this:

/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 14, 2023

k8s-ci-robot requested review from leiyiz and mattcary June 14, 2023 17:03

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2023

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 14, 2023

k8s-ci-robot assigned saikat-royc Jun 14, 2023

amacaskill changed the title ~~Filter out GCE operation error codes caused by user errors~~ [WIP] Filter out GCE operation error codes caused by user errors Jun 14, 2023

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 14, 2023

amacaskill force-pushed the op-error-codes branch from 2f66350 to eeb01ef Compare June 14, 2023 17:24

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 14, 2023

amacaskill changed the title ~~[WIP] Filter out GCE operation error codes caused by user errors~~ Filter out GCE operation error codes caused by user errors Jun 14, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 14, 2023

amacaskill force-pushed the op-error-codes branch from eeb01ef to c87159c Compare June 14, 2023 17:38

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 14, 2023

amacaskill force-pushed the op-error-codes branch from c87159c to dbc1c43 Compare June 14, 2023 17:53

filter out gce operation error codes caused by user errors

c0c3af1

amacaskill force-pushed the op-error-codes branch from dbc1c43 to c0c3af1 Compare June 14, 2023 18:03

saikat-royc reviewed Jun 14, 2023

View reviewed changes

amacaskill requested a review from saikat-royc June 20, 2023 16:09

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 11, 2023

k8s-ci-robot merged commit d14b457 into kubernetes-sigs:master Jul 11, 2023

k8s-infra-cherrypick-robot mentioned this pull request Jul 11, 2023

[release-1.10] Filter out GCE operation error codes caused by user errors #1292

Merged

k8s-infra-cherrypick-robot mentioned this pull request Aug 2, 2023

[release-1.9] Filter out GCE operation error codes caused by user errors #1322

Merged

amacaskill mentioned this pull request Aug 10, 2023

Always call LoggedError for errors returned from CloudProvider methods #1338

Merged

k8s-infra-cherrypick-robot mentioned this pull request Oct 31, 2023

[release-1.8] Filter out GCE operation error codes caused by user errors #1475

Merged

k8s-infra-cherrypick-robot mentioned this pull request Oct 31, 2023

[release-1.7] Filter out GCE operation error codes caused by user errors #1476

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter out GCE operation error codes caused by user errors #1261

Filter out GCE operation error codes caused by user errors #1261

amacaskill commented Jun 14, 2023 •

edited

Loading

k8s-ci-robot commented Jun 14, 2023

amacaskill commented Jun 14, 2023

saikat-royc Jun 14, 2023

amacaskill Jun 15, 2023

judemars Jun 21, 2023

saikat-royc Jul 11, 2023

saikat-royc Jun 14, 2023

amacaskill Jun 15, 2023

saikat-royc Jul 11, 2023

saikat-royc commented Jul 11, 2023

amacaskill commented Jul 11, 2023

k8s-infra-cherrypick-robot commented Jul 11, 2023

judemars commented Aug 2, 2023

k8s-infra-cherrypick-robot commented Aug 2, 2023

amacaskill commented Oct 31, 2023

k8s-infra-cherrypick-robot commented Oct 31, 2023

amacaskill commented Oct 31, 2023

k8s-infra-cherrypick-robot commented Oct 31, 2023

		@@ -856,14 +856,40 @@ func (cloud *CloudProvider) WaitForAttach(ctx context.Context, project string, v
		}

		func wrapOpErr(name string, opErr *computev1.OperationErrorErrors) error {

Filter out GCE operation error codes caused by user errors #1261

Filter out GCE operation error codes caused by user errors #1261

Conversation

amacaskill commented Jun 14, 2023 • edited Loading

k8s-ci-robot commented Jun 14, 2023

amacaskill commented Jun 14, 2023

saikat-royc Jun 14, 2023

Choose a reason for hiding this comment

amacaskill Jun 15, 2023

Choose a reason for hiding this comment

judemars Jun 21, 2023

Choose a reason for hiding this comment

saikat-royc Jul 11, 2023

Choose a reason for hiding this comment

saikat-royc Jun 14, 2023

Choose a reason for hiding this comment

amacaskill Jun 15, 2023

Choose a reason for hiding this comment

saikat-royc Jul 11, 2023

Choose a reason for hiding this comment

saikat-royc commented Jul 11, 2023

amacaskill commented Jul 11, 2023

k8s-infra-cherrypick-robot commented Jul 11, 2023

judemars commented Aug 2, 2023

k8s-infra-cherrypick-robot commented Aug 2, 2023

amacaskill commented Oct 31, 2023

k8s-infra-cherrypick-robot commented Oct 31, 2023

amacaskill commented Oct 31, 2023

k8s-infra-cherrypick-robot commented Oct 31, 2023

amacaskill commented Jun 14, 2023 •

edited

Loading