Skip to content

Sanity test do not work with the PDCSI driver attach detach behavior change #990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
saikat-royc opened this issue May 18, 2022 · 33 comments · Fixed by #1087
Closed

Sanity test do not work with the PDCSI driver attach detach behavior change #990

saikat-royc opened this issue May 18, 2022 · 33 comments · Fixed by #1087
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@saikat-royc
Copy link
Member

saikat-royc commented May 18, 2022

#988 introduces a behavior change in PDCSI driver controller publish/unpublish call. If any error is encountered in the controller publish or unpublish path, the driver marks the target node with a backoff condition. The implication of this is that, until the backoff time has expired, all ControllerPublish/Unpublish calls for the target node will be directly denied by the driver.

The sanity tests runs a series of negative/positive test cases on the same underlying driver. This means if a negative test case for controller publish or unpublish runs first (e.g ControllerPublishVolume should fail when the volume is already published but is incompatible code) this failure causes the driver to mark the target node in a backoff condition. This causes the next postive controller publish test to fail with error (Controller Service [Controller Server] should work code)

rpc error: code = Unavailable desc = ControllerPublish not permitted on node "projects/test-project/zones/country-region-zone/instances/test-name" due to backoff condition

the test cases can fail randomly based on the sequence of the test cases run.

See a failed run here

@saikat-royc
Copy link
Member Author

fyi @mattcary

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2022
@mattcary
Copy link
Contributor

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 16, 2022
@judemars
Copy link
Contributor

I noticed there are a few more failures now so I am going to pick this up and try to fix them.

[Fail] Controller Service [Controller Server] [AfterEach] ControllerPublishVolume should fail when the volume is already published but is incompatible 
/usr/local/google/home/judemars/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113

[Fail] Node Service NodeUnpublishVolume [It] should remove target path 
/usr/local/google/home/judemars/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/vendor/github.com/kubernetes-csi/csi-test/v4/pkg/sanity/node.go:250

[Fail] Node Service NodeGetVolumeStats [It] should fail when volume does not exist on the specified path 
/usr/local/google/home/judemars/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/vendor/github.com/kubernetes-csi/csi-test/v4/pkg/sanity/node.go:250

[Fail] Node Service [It] should work 
/usr/local/google/home/judemars/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/vendor/github.com/kubernetes-csi/csi-test/v4/pkg/sanity/node.go:140

@judemars
Copy link
Contributor

judemars commented Nov 17, 2022

Okay so #1083 fixes the latter 3 of those failures, but the first one (which this bug was originally about) still exists.

Note that (as @saikat-royc pointed out to me) the backoff behavior changed from per node to per disk in #1028, so the order of tests should no longer be a problem.

So presumably the problem is coming from within the test case 👻 . This is my understanding of the scenario of the failing test:

  1. We create a volume -> succeeds
  2. We publish that volume -> succeeds
  3. We re-publish that volume, expecting an error bc it was already published -> fails (and triggers backoff)
  4. Cleanup function tries to unpublish volume, and this fails the test case with the error ControllerUnpublish not permitted on node "projects/test-project/zones/country-region-zone/instances/test-name" due to backoff condition because of Step 3.

@saikat-royc / @mattcary: Let me know if it sounds correct that step 3 would trigger step 4 to fail and if so, what y'all think we should do. Maybe the answer is to add an exception to backoff for unit tests?

@mattcary
Copy link
Contributor

We have another issue open on sanity tests from a little bit ago---I think the basic problem is that the sanity tests are starting to test our mock more than our actual code. At the time I was thinking we would want to change the sanity tests to use an actual gcp client rather than a mock. Then all these problems go away I think, even this third one? But I'm not sure.

@judemars
Copy link
Contributor

Oh alrighty I see #991. Do we want to abandon #1083 and work on that instead? Or go ahead with it, fix the other issue (that this bug is about) and then add them as a PR gate so the tests don't get further broken while we work on #991?

Separately, I'm still curious about the best way forward to resolve the backoff problem in this test. I'm not convinced that the mock version of the device utils is to blame for that problem, since (if my theory is right above) the prior failure in the same test case is causing the backoff error later in the test.

@mattcary
Copy link
Contributor

Ah, right you are about the backoff problem. Hmm.

Maybe the best way forward is this: could you summarize what the sanity tests do? Then maybe it will be clear what the assumptions are around mocks and around stateful behavior of the driver.

@judemars
Copy link
Contributor

judemars commented Nov 17, 2022

I summarized the failing test in question above (see steps 1-> 4), but an explanation of all the tests is at https://github.com/kubernetes-csi/csi-test/blob/master/pkg/sanity/README.md. They are supplied by CSI itself (not in our repo) and verify basic adherence of our driver to the CSI spec.

@mattcary
Copy link
Contributor

That page doesn't actually describe what's going very well. What exactly is configured? Is a driver process spun up? Is there any organization to the sanity tests?

@judemars
Copy link
Contributor

Ah sorry - ok here is where the tests are configured.

Looks like we mock out:

But these are real (using those faked dependencies):

  • identityService
  • nodeService
  • controllerService
  • gceDriver itself

Then here we set up a real gceDriver to run at the endpoint we pass into the test functions

For this test in particular in regards to how much is mocked out, we are exercising this part of the real controller code that checks for incompatibility with existing disks, even though the specifics of the disks themselves are being supplied by the mock.

The history isn't great on why these sanity tests were invented but it seems from the name at least to want to be a lightweight check for basic functionality of the driver's various contracts w/ CSI. When you say replacing the mocks with "an actual gcp client" what exactly are you referring to - talking to the real GCE instance? It feels to me if that's what we want to do it may be better to just fold these into e2e tests rather than having a separate "sanity" test.

Lastly, I'm not sure what you were thinking re: organization but there are different batches of tests defined by files in https://github.com/kubernetes-csi/csi-test/tree/master/pkg/sanity.

@mattcary
Copy link
Contributor

When you say replacing the mocks with "an actual gcp client" what exactly are you referring to - talking to the real GCE instance? It feels to me if that's what we want to do it may be better to just fold these into e2e tests rather than having a separate "sanity" test.

Yeah, that's what I was thinking. It may be that our e2e tests could be made more focused if they overlap with sanity.

I wonder if we could use beforeeach/aftereach stanzas to create a new driver before each test (or recreate it if it's in an error state, if creating one before each test is too expensive), to solve the backoff problem.

@judemars
Copy link
Contributor

I'm still not sure that would help this specific test failure since the backoff occurs (as expected) within the test, which affects the cleanup in a later part of the same test.

@mattcary
Copy link
Contributor

Ah crotte.

@mattcary
Copy link
Contributor

Hmm, the are run in different It()s, so I'd think there'd be some flavor of beforeeach we could use? Is there a beforeDescribe? (there doesn't seem to be, but maybe beforeEach is sufficient?)

@judemars
Copy link
Contributor

judemars commented Nov 18, 2022

They all run in the same It() for this case - this one. The problem of separate test cases affecting eachother's backoff scenarios no longers presents after your change in #1028. We only have one test (in one It()) that runs the 4 steps above and fails. And if I'm understanding backoff correctly, I think it's working as expected since step 3 triggers backoff and causes step 4 to error.

@pwschuurman
Copy link
Contributor

@saikat-royc Can we change the controller to not put the driver in the backoff condition if the error isn't a "rate limit exceeded" from GCE API? From what I recall, we added node backoff logic in #847 due to VM queue. I think that would fix the sanity tests, since we're not encountering this specific error (it's a volume already published error).

@mattcary
Copy link
Contributor

We also want to backoff if the detach fails due to the node not existing or something like that, right?

I think that the assumption the sanity tests make that the tests are all independent is just wrong and we should fix that---if need be by recreating the driver for each It(.) (which is what ginkgo assumes IIUC).

@judemars
Copy link
Contributor

Re: my comment above, the problem is happening within one test, one It(). For the one failing test, within one It() we do:

  1. We create a volume -> succeeds
  2. We publish that volume -> succeeds
  3. We re-publish that volume, expecting an error bc it was already published -> fails (and triggers backoff)
  4. Cleanup function tries to unpublish volume, and this fails the test case with the error ControllerUnpublish not permitted on node "projects/test-project/zones/country-region-zone/instances/test-name" due to backoff condition because of Step 3.

@judemars
Copy link
Contributor

Another possible solution here maybe is to make the backoff conditions RPC-specific? So that failing a publish request would not stop an unpublish request from failing, which seems to be the case here.

@pwschuurman
Copy link
Contributor

We also want to backoff if the detach fails due to the node not existing or something like that, right?

@mattcary Is this just to preemptively get ahead of quota issues before they happen? (eg: if a node doesn't exist, presumably we wouldn't want to continue checking and reach 60s user quota limits for the GCE API)

@mattcary
Copy link
Contributor

@judemars in that test you reference, it looks like it only fails at the end of the It()?

@mattcary
Copy link
Contributor

@pwschuurman There's only one quota for all api calls, so if we wait until the detach calls hit quota, we'll have other operations unnecessarily fail.

Also given the problems we have with arcus queuing generally I think that making an effort not to make the problem worse will reduce stability.

I don't think we should change the driver behavior due to limitations in our testing infrastructure.

@judemars
Copy link
Contributor

judemars commented Nov 21, 2022

Correct, it fails w/ a backoff error in the cleanup function at the end of the It(), since a previous call within the same test case (still within the same It()) to publish a volume failed. And the previous failure was expected since that is what the test case is testing, which is "should fail when the volume is already published but is incompatible"

@mattcary
Copy link
Contributor

@judemars Making them rpc-specific might be a solution, but since quotas are counted over all RPCs I think we'd get complicated quickly trying to track stuff.

We've already had churn going from per-node backoff to per-node/per-volume backoff. As far as we know, the current system works well in practice, and as mentioned above I'm hesitate to complicate the driver behavior in order to make the tests run better. It seems like a much better tradeoff to complicate the tests since if we get it wrong it's much easier to fix.

@mattcary
Copy link
Contributor

@judemars Oh, interesting, I didn't realize it was the cleanup function that was failing.

In many of the other tests----the GKE guitar ones in particular---we make errors in the cleanup functions informational only, and they don't fail the test. Maybe that would be a better way to do it?

@judemars
Copy link
Contributor

Wouldn't that lead to leaking GCE resources though, if we moved to a real GCE instance backing these tests? Updating these tests to make them informational might be fine if they were specific to our driver but actually they are shared by all CSI drivers so I would guess there's a higher bar for cleanliness.

@mattcary
Copy link
Contributor

Fair point.

If we retried on the cleanup, would that fix it?

@judemars
Copy link
Contributor

judemars commented Nov 21, 2022

Ah yes looks like if we tried after 200ms it would work. Maybe another option that would let us not have to insert a retry with wait into the shared tests is to set the backoff initial time period way lower for the sanity tests? So that by the time the cleanup happened the backoff wait period had already finished?

That's slightly non-ideal though since it will potentially make all backoff functionality untested in these tests. I can also see if there is a way to set it low for just this one test?

@mattcary
Copy link
Contributor

mattcary commented Nov 21, 2022

That could be a good idea.

Since the backoff appears to break the sanity test's assumption of how the driver works, maybe it's okay to set the threshold very low. It seems like the only thing backoff can do is make a false positive failure.

(assuming we can set the backoff low enough that the cleanup isn't flaky)

@judemars
Copy link
Contributor

judemars commented Nov 21, 2022

Alrighty I added #1087 for lowering the backoff duration for sanity tests.

Note I also mocked out what it would look like to have per-RPC backoff conditions in #1089 to see how big of a change it would be and it doesn't look like too much. I personally prefer that approach over #1087 as it doesn't diverge the test and real behaviors of the driver at all, but it's definitely true I don't have knowledge of the history for backoff so I'll defer to your decision.

@saikat-royc
Copy link
Member Author

+1 to #1087 (put my comments in the PR) . I feel #1089 is a large hammer which is not needed at this point in time.

@judemars
Copy link
Contributor

Alrighty, abandoned #1089 in favor of #1087

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
6 participants