Flaky unit test #1137

sunnylovestiramisu · 2023-02-15T01:19:15Z

Unit tests have been flaky: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-gcp-compute-persistent-disk-csi-driver-unit

Here is an example log:

I0215 01:13:58.388] --- FAIL: TestCreateVolumeWithVolumeSourceFromVolume (0.00s)
I0215 01:13:58.388]     controller_test.go:1266: test case: success zonal disk clone of zonal source disk
I0215 01:13:58.389]     controller_test.go:1305: response: <nil> err: rpc error: code = InvalidArgument desc = CreateVolume disk zone country-region-zone does not match source volume zone country-region-fakesecondzone
I0215 01:13:58.389]     controller_test.go:1312: Expected error code: OK, got: InvalidArgument. err : rpc error: code = InvalidArgument desc = CreateVolume disk zone country-region-zone does not match source volume zone country-region-fakesecondzone

The text was updated successfully, but these errors were encountered:

saikat-royc · 2023-02-16T00:24:25Z

this test fails https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blame/master/pkg/gce-pd-csi-driver/controller_test.go#L1158 when the random logic picks a different source zone (https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/controller.go#L1387) from the requisite list https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/controller_test.go#L1135 passed in the test case. https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/controller_test.go#L1135

saikat-royc · 2023-02-16T00:25:22Z

/cc @amacaskill

amacaskill · 2023-02-16T01:51:17Z

For zonal->zonal cloning, the zone needs to match, so I think pickZones](

gcp-compute-persistent-disk-csi-driver/pkg/gce-pd-csi-driver/controller.go

Line 1433 in 0e7a44f

zones, err = pickZonesFromTopology(top, numZones)

should be based a refined topology that only includes the required zone if we are doing volume cloning zonally or required region.

I think the zonal test could also be flakey if we had different test cases, so this change is needed for regional and zonal.

The problem is that the volKey is calculated before we determine we are creating the volume with cloning. I think we should check if we are using volume cloning, and what the source volume replication type / zone is before we calculate the volKey. That way, we can do a different calculation of volKey that will follow [volume cloning restrictions](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/volume-cloning#limitations restrictions if we are using cloning.

amacaskill · 2023-02-23T01:51:55Z

When I initially looked at this issue, I thought the volume cloning implementation was wrong which is what made the test flakey, but I think the test is actually working as expected and the parameters of the test are wrong. The request.AccessibilityRequirements is created in the external provisioner in GenerateAccessibilityRequirements. GenerateAccessibilityRequirements will fill out the Requisite and Preferred Topology Requirements according to this table.

If the selectedNode param passed to GenerateAccessibilityRequirements is set, then this is Immediate binding mode. If the selectedNode is set, then this is WaitForFirstConsumer binding mode. This selectedNode param is not passed to the CreateVolume call, only the Topology requirements with the requisite and preferred filled out.

For volume cloning in immediate binding mode, users need to set allowedTopologies in their StorageClass to make sure that the correct requisite/preferred topologies get passes to the CSIDriver. If there is some signal within the CreateVolumeRequest that can tell us if the CreateVolume request is for a PVC with Immediate binding mode (and not WaitForFirstConsumer), then we could use the volume cloning fix that I described in my previous commentI. However, I don't think such a signal exists, so I don't think we can do anything here.

The fix I made will definitely not work for WaitForFirstConsumer because we need to respect the zone the pod is scheduled to. Therefore, users either need to set node/pod affinity or set allowedTopologies in their StorageClass to ensure they will not get this error.

So in summary:

If there is a signal within the CreateVolumeRequest that says this Volume should be created with Immediate binding mode, then I can make this change for those cases only.
I will need to change the topology requirements in the CreateVolumeRequest for the volume cloning unit test for the WaitForFirstConsumer (and maybe Immediate) binding mode CreateVolume calls to make it less flakey.

saikat-royc assigned amacaskill Feb 16, 2023

amacaskill mentioned this issue Feb 17, 2023

satisfy volume cloning topology requirements when choosing zone for CreateVolume #1150

Merged

k8s-ci-robot closed this as completed in #1150 Mar 17, 2023

sunnylovestiramisu mentioned this issue Jul 19, 2023

[Release-1.7] cherry pick of #1150 satisfy volume cloning topology requirements when choosing zone for CreateVolume #1301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky unit test #1137

Flaky unit test #1137

sunnylovestiramisu commented Feb 15, 2023

saikat-royc commented Feb 16, 2023 •

edited

Loading

saikat-royc commented Feb 16, 2023

amacaskill commented Feb 16, 2023

amacaskill commented Feb 23, 2023

Flaky unit test #1137

Flaky unit test #1137

Comments

sunnylovestiramisu commented Feb 15, 2023

saikat-royc commented Feb 16, 2023 • edited Loading

saikat-royc commented Feb 16, 2023

amacaskill commented Feb 16, 2023

amacaskill commented Feb 23, 2023

saikat-royc commented Feb 16, 2023 •

edited

Loading