Skip to content

Flaky unit test #1137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sunnylovestiramisu opened this issue Feb 15, 2023 · 4 comments · Fixed by #1150
Closed

Flaky unit test #1137

sunnylovestiramisu opened this issue Feb 15, 2023 · 4 comments · Fixed by #1150
Assignees

Comments

@sunnylovestiramisu
Copy link
Contributor

Unit tests have been flaky: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-gcp-compute-persistent-disk-csi-driver-unit

Here is an example log:

I0215 01:13:58.388] --- FAIL: TestCreateVolumeWithVolumeSourceFromVolume (0.00s)
I0215 01:13:58.388]     controller_test.go:1266: test case: success zonal disk clone of zonal source disk
I0215 01:13:58.389]     controller_test.go:1305: response: <nil> err: rpc error: code = InvalidArgument desc = CreateVolume disk zone country-region-zone does not match source volume zone country-region-fakesecondzone
I0215 01:13:58.389]     controller_test.go:1312: Expected error code: OK, got: InvalidArgument. err : rpc error: code = InvalidArgument desc = CreateVolume disk zone country-region-zone does not match source volume zone country-region-fakesecondzone
@saikat-royc
Copy link
Member

/cc @amacaskill

@amacaskill
Copy link
Member

For zonal->zonal cloning, the zone needs to match, so I think pickZones](

zones, err = pickZonesFromTopology(top, numZones)
should be based a refined topology that only includes the required zone if we are doing volume cloning zonally or required region.

I think the zonal test could also be flakey if we had different test cases, so this change is needed for regional and zonal.

The problem is that the volKey is calculated before we determine we are creating the volume with cloning. I think we should check if we are using volume cloning, and what the source volume replication type / zone is before we calculate the volKey. That way, we can do a different calculation of volKey that will follow [volume cloning restrictions](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/volume-cloning#limitations restrictions if we are using cloning.

@amacaskill
Copy link
Member

When I initially looked at this issue, I thought the volume cloning implementation was wrong which is what made the test flakey, but I think the test is actually working as expected and the parameters of the test are wrong. The request.AccessibilityRequirements is created in the external provisioner in GenerateAccessibilityRequirements. GenerateAccessibilityRequirements will fill out the Requisite and Preferred Topology Requirements according to this table.

If the selectedNode param passed to GenerateAccessibilityRequirements is set, then this is Immediate binding mode. If the selectedNode is set, then this is WaitForFirstConsumer binding mode. This selectedNode param is not passed to the CreateVolume call, only the Topology requirements with the requisite and preferred filled out.

For volume cloning in immediate binding mode, users need to set allowedTopologies in their StorageClass to make sure that the correct requisite/preferred topologies get passes to the CSIDriver. If there is some signal within the CreateVolumeRequest that can tell us if the CreateVolume request is for a PVC with Immediate binding mode (and not WaitForFirstConsumer), then we could use the volume cloning fix that I described in my previous commentI. However, I don't think such a signal exists, so I don't think we can do anything here.

The fix I made will definitely not work for WaitForFirstConsumer because we need to respect the zone the pod is scheduled to. Therefore, users either need to set node/pod affinity or set allowedTopologies in their StorageClass to ensure they will not get this error.

So in summary:

  1. If there is a signal within the CreateVolumeRequest that says this Volume should be created with Immediate binding mode, then I can make this change for those cases only.
  2. I will need to change the topology requirements in the CreateVolumeRequest for the volume cloning unit test for the WaitForFirstConsumer (and maybe Immediate) binding mode CreateVolume calls to make it less flakey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants