Skip to content

Separate user errors from internal errors #1092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 20, 2022

Conversation

sunnylovestiramisu
Copy link
Contributor

@sunnylovestiramisu sunnylovestiramisu commented Nov 22, 2022

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change

/kind bug

/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
Exclude user errors for googleapis from codes.internal errors for better error categorization
Conversation from #1084

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
You are reviewing three different things:

  1. fmt.Errorf should use %w for errors
  2. logging should use %v for err, or err.Error() with err nil check
  3. simplifying status.Error to status.Errorf to avoid the Sprintf call

Does this PR introduce a user-facing change?:
NONE


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 22, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @sunnylovestiramisu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 22, 2022
@mattcary
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 22, 2022
@@ -212,7 +212,11 @@ func (gceCS *GCEControllerServer) CreateVolume(ctx context.Context, req *csi.Cre

volumeID, err := common.KeyToVolumeID(volKey, gceCS.CloudProvider.GetDefaultProject())
if err != nil {
return nil, status.Errorf(codes.Internal, "Failed to convert volume key to volume ID: %v", err)
klog.Errorf("Failed to convert volume key to volume ID: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so much boilerplate for every error. It will be hard to be consistent about all of this in the future. Instead of IsUserError could we have something like CodeForError so the whole thing looks like the following?

if err != nil {
return nil, status.Errorf(CodeForError(err), "Failed to convert volume key to volume ID: %v", err)
}

That does mean that we don't get the klog---but we weren't getting that before anyway. So maybe we could have
if err != nil {
return nil, LoggedError("Failed to convert volume key to volume ID: %w", err)
}

and then LoggedError will klog as well as produce the status.Errorf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything in go that is similar to java's global exception handler style at all? https://stackoverflow.com/questions/48785878/spring-boot-intercept-all-exception-handlers

I looked at the codebase most of the error handling seems very boilerplate compared to the example in java.
Here we want to avoid printing error in %v so that the error object does not change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

golang error handling is crap as far as I can tell---in addition to all the boilerplate, go's aversion to stack traces mean you never really know where an error originally came from.

In particular I like doing a klog in addition to an error return, because the klog gives you generally better information about where the error came from---but the error return is necessary as well because that will get plumbed into an RPC reply which will give clues in the sidecars or kubelet logs.

Maybe if LoggedError(msg, err) looks like { klog(msg, err); return status.Errorf(code, err)}, and we use %v in msg, then it will be enough? We don't care about the %v conversion in the klog, as we only human-read that, it's the status.Errorf call that's important?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, if we are passing the error as a return value, I prefer to leave it up to the caller to log and only log if we are going to swallow an error and not pass it up. That helps avoid duplicate logging of errors at multiple levels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I've had more problems with seeing an error up high in the stack, and not being able to trace back to where it was emitted, than any ambiguities around duplicated messages in the logs. That may be the style in go but if it's possible to push back against it, I'd like to....

@judemars
Copy link
Contributor

Hey there! Just letting you know that I have just fixed and re-enabled the sanity tests for PD in go/pdcsi-oss-driver/issues/990 so if you experience failures in your sanity test PR gate build because you're on an old branch before the tests were fixed, merging the latest commits from master into this branch it should make them pass. Let me know if you have any questions!

@sunnylovestiramisu sunnylovestiramisu force-pushed the statusError branch 4 times, most recently from 716b1da to 1809ff1 Compare December 1, 2022 16:50
@sunnylovestiramisu
Copy link
Contributor Author

/retest

@judemars
Copy link
Contributor

judemars commented Dec 1, 2022

/retest pull-gcp-compute-persistent-disk-csi-driver-sanity

I see the fix is in your history of the branch so this is concerning. The error is coming from the backoff condition threshold so wondering if 1ms could cause flakes. Retest should let us know.

@k8s-ci-robot
Copy link
Contributor

@judemars: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test pull-gcp-compute-persistent-disk-csi-driver-e2e
  • /test pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration
  • /test pull-gcp-compute-persistent-disk-csi-driver-sanity
  • /test pull-gcp-compute-persistent-disk-csi-driver-unit
  • /test pull-gcp-compute-persistent-disk-csi-driver-verify

The following commands are available to trigger optional jobs:

  • /test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

Use /test all to run the following jobs that were automatically triggered:

  • pull-gcp-compute-persistent-disk-csi-driver-e2e
  • pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration
  • pull-gcp-compute-persistent-disk-csi-driver-sanity
  • pull-gcp-compute-persistent-disk-csi-driver-unit
  • pull-gcp-compute-persistent-disk-csi-driver-verify

In response to this:

/retest pull-gcp-compute-persistent-disk-csi-driver-sanity

I see the fix is in your history of the branch so this is concerning. The error is coming from the backoff condition threshold so wondering if 1ms could cause flakes. Retest should let us know.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@judemars
Copy link
Contributor

judemars commented Dec 1, 2022

Woops:

/test pull-gcp-compute-persistent-disk-csi-driver-sanity

@judemars
Copy link
Contributor

judemars commented Dec 1, 2022

Failed again. Will try again:

/test pull-gcp-compute-persistent-disk-csi-driver-sanity

@judemars
Copy link
Contributor

judemars commented Dec 1, 2022

kubernetes/test-infra#28158 should disable the sanity PR gate, sorry for the hassle!

@sunnylovestiramisu
Copy link
Contributor Author

/assign @mattcary

@sunnylovestiramisu sunnylovestiramisu marked this pull request as ready for review December 1, 2022 19:02
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2022
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 19, 2022
*/
return true
if err != nil {
klog.V(4).Infof("NodePublishVolume check volume path %s is mounted %t: error %v", path, !notMnt, err.Error())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do

errInfo = "nil"
if err != nil {
    errInfo = err.Error()
}
klog.V(4).Infof("NodePublishVolume check volume path %s is mounted %t: error %v", path, !notMnt, errInfo)

to log everything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do a %v on err does that print out the same thing as err.Error()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we print out %v the error cannot get unwrap properly anymore. That is what we are trying to solve?

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 20, 2022
}
default:
return nil, status.Error(codes.InvalidArgument, fmt.Sprintf("CreateVolume replication type '%s' is not supported", params.ReplicationType))
}

ready, err := isDiskReady(disk)
if err != nil {
return nil, status.Error(codes.Internal, fmt.Sprintf("CreateVolume disk %v had error checking ready status: %v", volKey, err))
return nil, status.Error(codes.Internal, fmt.Sprintf("CreateVolume disk %v had error checking ready status: %v", volKey, err.Error()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we LoggedError here too?

Copy link
Contributor Author

@sunnylovestiramisu sunnylovestiramisu Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isDiskReady does not throw user error we defined in this pr:

func isDiskReady(disk *gce.CloudDisk) (bool, error) {
	status := disk.GetStatus()
	switch status {
	case "READY":
		return true, nil
	case "FAILED":
		return false, fmt.Errorf("Disk %s status is FAILED", disk.GetName())
	case "CREATING":
		klog.V(4).Infof("Disk %s status is CREATING", disk.GetName())
		return false, nil
	case "DELETING":
		klog.V(4).Infof("Disk %s status is DELETING", disk.GetName())
		return false, nil
	case "RESTORING":
		klog.V(4).Infof("Disk %s status is RESTORING", disk.GetName())
		return false, nil
	default:
		klog.V(4).Infof("Disk %s status is: %s", disk.GetName(), status)
	}

	return false, nil
}

Do we still want to use the LoggedError(which is logging error for user error case)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, then you're right it's not necessary.

It would be nice to have a more uniform location to log errors so its clear what is available for debugging. But that's out of scope here.

}
return nil, status.Errorf(codes.Internal, "ControllerUnpublishVolume error repairing underspecified volume key: %v", err)
return nil, LoggedError("ControllerUnpublishVolume error repairing underspecified volume key: ", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we logging only here, and not when the volume is not found? Probably we should log everything (?)

Copy link
Contributor Author

@sunnylovestiramisu sunnylovestiramisu Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add logging for notFound. I was trying to keep the pr focused. Initially this pr is just fix unwrap error. But at this point it is a blend of adding logs + fixing error unwrap + fixing status.Error to status.Errorf(?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fair enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay i added the change for fixing status.Error to status.Errorf as well in this pr but committed it in a second commit. It looks much simpler :D

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 20, 2022
@mattcary
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 20, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattcary, sunnylovestiramisu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 20, 2022
@sunnylovestiramisu
Copy link
Contributor Author

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 20, 2022
@k8s-ci-robot k8s-ci-robot merged commit 436d145 into kubernetes-sigs:master Dec 20, 2022
@sunnylovestiramisu sunnylovestiramisu deleted the statusError branch December 28, 2022 18:54
k8s-ci-robot added a commit that referenced this pull request Jan 23, 2023
…k-of-#1073-#1084-#1092-upstream-release-1.8

Automated cherry pick of #1073: Add debugging log for the mapping of a PD name to /dev/* path
#1084: Upgrade klog v1 to v2 and fix error wrapping
#1092: Separate user errors from internal errors
k8s-ci-robot added a commit that referenced this pull request May 10, 2023
…k-of-#1092-upstream-release-1.7

Automated cherry pick of #1092: Separate user errors from internal errors
@judemars
Copy link
Contributor

/cherry-pick release-1.10

@k8s-infra-cherrypick-robot

@judemars: #1092 failed to apply on top of branch "release-1.10":

Applying: Separate user errors from internal errors
Using index info to reconstruct a base tree...
M	cmd/gce-pd-csi-driver/main.go
M	pkg/deviceutils/device-utils.go
M	pkg/gce-cloud-provider/compute/fake-gce.go
M	pkg/gce-cloud-provider/compute/gce-compute.go
M	pkg/gce-cloud-provider/compute/gce.go
M	pkg/gce-cloud-provider/metadata/metadata.go
M	pkg/gce-pd-csi-driver/controller.go
M	pkg/gce-pd-csi-driver/node.go
M	pkg/gce-pd-csi-driver/server.go
M	pkg/gce-pd-csi-driver/utils.go
M	pkg/gce-pd-csi-driver/utils_linux.go
M	pkg/gce-pd-csi-driver/utils_windows.go
M	pkg/metrics/metrics.go
M	pkg/mount-manager/safe-mounter_windows.go
M	pkg/mount-manager/statter_linux.go
M	test/k8s-integration/cluster.go
M	test/remote/ssh.go
M	test/sanity/sanity_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/remote/ssh.go
Auto-merging test/k8s-integration/cluster.go
Auto-merging pkg/mount-manager/safe-mounter_windows.go
Auto-merging pkg/metrics/metrics.go
Auto-merging pkg/gce-pd-csi-driver/utils.go
Auto-merging pkg/gce-pd-csi-driver/controller.go
CONFLICT (content): Merge conflict in pkg/gce-pd-csi-driver/controller.go
Auto-merging pkg/gce-cloud-provider/compute/gce.go
Auto-merging pkg/gce-cloud-provider/compute/gce-compute.go
Auto-merging pkg/gce-cloud-provider/compute/fake-gce.go
Auto-merging pkg/deviceutils/device-utils.go
CONFLICT (content): Merge conflict in pkg/deviceutils/device-utils.go
Auto-merging cmd/gce-pd-csi-driver/main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Separate user errors from internal errors
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants