Skip to content

Filter out GCE operation error codes caused by user errors #1261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 11, 2023

Conversation

amacaskill
Copy link
Member

@amacaskill amacaskill commented Jun 14, 2023

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug

/kind cleanup

/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
Filter out GCE operation error codes caused by user errors. We are getting lots of errors counting against our SLO for GCE operation errors that are user errors. We have seen Internal errors returned from the following GCE operation error codes: OPERATION_CANCELED_BY_USER, RESOURCE_NOT_FOUND, RESOURCE_IN_USE_BY_ANOTHER_RESOURCE, INVALID_USAGE. I added a filtered out other error codes (quota error codes) which we haven't observed yet, but those are clearly user error codes. The errors that we have seen so far look like this:

ControllerPublishVolume returned with error: rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-xxx failed (OPERATION_CANCELED_BY_USER): Operation was canceled by user ''.

ControllerPublishVolume returned with error: rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-xxx failed (RESOURCE_NOT_FOUND): The resource 'projects/xxx/zones/xxx/instances/xxx' was not found

ControllerPublishVolume returned with error: rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-xxx failed (RESOURCE_IN_USE_BY_ANOTHER_RESOURCE): The disk resource 'projects/xxx/zones/xxx/disks/xxx' is already being used by 'projects/xxx/zones/xxx/instances/xxx'

ControllerUnpublishVolume returned with error: rpc error: code = Internal desc = unknown detach error: operation operation-xxx failed (INVALID_USAGE): No attached disk found with device name 'pvc-xxx'

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

None

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 14, 2023
@k8s-ci-robot k8s-ci-robot requested review from leiyiz and mattcary June 14, 2023 17:03
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amacaskill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2023
@amacaskill
Copy link
Member Author

/assign @saikat-royc

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 14, 2023
@amacaskill amacaskill changed the title Filter out GCE operation error codes caused by user errors [WIP] Filter out GCE operation error codes caused by user errors Jun 14, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 14, 2023
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 14, 2023
@amacaskill amacaskill changed the title [WIP] Filter out GCE operation error codes caused by user errors Filter out GCE operation error codes caused by user errors Jun 14, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 14, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 14, 2023
"RESOURCE_IN_USE_BY_ANOTHER_RESOURCE": codes.InvalidArgument,
"OPERATION_CANCELED_BY_USER": codes.Aborted,
"QUOTA_EXCEEDED": codes.ResourceExhausted,
"ZONE_RESOURCE_POOL_EXHAUSTED": codes.Unavailable,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use codes.ResourceExhausted for ZONE_RESOURCE_POOL_EXHAUSTED, ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these codes, it looks like GCE maps it to the Unavailable grpc error code, so that is why I chose Unavailable. That being said, we don't filter out Unavailable fromm our SLO, so if we don't want these, then I should use resource exhausted. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We looked at this for the CreateVolume SLO as well and ultimately decided we would want to get notified if users were hitting an irregular amount of ZONE_RESOURCE_POOL_EXHAUSTED errors since they're not under their control. So +1 to Unavailable since it allows us to track it (or not) separately from other user-caused ResourceExhausted errors.

Also it looks like this PR accomplishes what I was doing in #1263 (classifying QUOTA_EXCEEDED as ResourceExhausted) so I'll close that - thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Please capture the reasoning in a comment in the code

@@ -856,14 +856,40 @@ func (cloud *CloudProvider) WaitForAttach(ctx context.Context, project string, v
}

func wrapOpErr(name string, opErr *computev1.OperationErrorErrors) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what about error handling for cases where the operation is not even initiated, and we encounter error in say insertDisk, attachDisk where the WaitForOp is not called. e.g code

Also we may need to handle uncastable error (i.e errors.As cannot case to gce.UnsupportedDiskError) like the ones we saw in filestore. May need some manual testing here (1. insert disk when quota is exhausted, 2. attachdisk when disk already attached) or we can address them lazily if we encounter them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked into quiet a few clusters, and I haven't seen any errors that are incorrectly returned as an Internal error at this point ( where the operation is not even initiated). I will look into some more clusters, but I have probably looked into at least 20 at this point, and I haven't seen errors like that.

Also we may need to handle uncastable error (i.e errors.As cannot cast to gce.UnsupportedDiskError) like the ones we saw in filestore.

I'm a little confused by this comment. Are you saying we could be missing some cases where we need to return an UnsupportedDiskError (which we do when the operation error has the UNSUPPORTED_OPERATION code). This won't happen like in filestore because wrapOpErr passes in an operation error of type computev1.OperationErrorErrors directly. This error type doesn't actually implement the error interface, so this will never happen. (errors.As cannot even be run on the operation error of type computev1.OperationErrorErrors since it doesn't implement error interface).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I missed the OperationErrorErrors type.

And yes, for unsupported operations, I think we would return kIternal for the UnsupportedDiskError. Until we repro this scenario, it would be difficult to handle this error. So that would be a followup for future to understand the sceario better.

As discussed, lets capture the known issues in some task:

  1. AttachDisk errors not handled by errors.As() code
  2. InsertDisk, DeleteDisk, DetachDisk errors are not handled.

@amacaskill amacaskill requested a review from saikat-royc June 20, 2023 16:09
@saikat-royc
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 11, 2023
@k8s-ci-robot k8s-ci-robot merged commit d14b457 into kubernetes-sigs:master Jul 11, 2023
@amacaskill
Copy link
Member Author

/cherry-pick release-1.10

@k8s-infra-cherrypick-robot

@amacaskill: new pull request created: #1292

In response to this:

/cherry-pick release-1.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@judemars
Copy link
Contributor

judemars commented Aug 2, 2023

/cherry-pick release-1.9

@k8s-infra-cherrypick-robot

@judemars: new pull request created: #1322

In response to this:

/cherry-pick release-1.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@amacaskill
Copy link
Member Author

/cherry-pick release-1.8

@k8s-infra-cherrypick-robot

@amacaskill: new pull request created: #1475

In response to this:

/cherry-pick release-1.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@amacaskill
Copy link
Member Author

/cherry-pick release-1.7

@k8s-infra-cherrypick-robot

@amacaskill: new pull request created: #1476

In response to this:

/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants