Skip to content

Bug in NodeExpandVolume #433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gnufied opened this issue Nov 25, 2019 · 16 comments · Fixed by #462
Closed

Bug in NodeExpandVolume #433

gnufied opened this issue Nov 25, 2019 · 16 comments · Fixed by #462
Assignees

Comments

@gnufied
Copy link

gnufied commented Nov 25, 2019

It looks like we may have a bug in CSI NodeExpandVolume implementation:

FailedMount: MountVolume.MountDevice failed while expanding volume for volume "pvc-229df8db-a895-43b9-8e0b-dafa5bdf19b2" : MountVolume.NodeExpandVolume failed : Expander.NodeExpand failed to expand the volume : rpc error: code = Internal desc = ControllerExpandVolume resize requested for 6917529027641081856 but after resize volume was size 6442450944

Failure logs - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-cos-k8sbeta-serial/1198911611914424322/build-log.txt

cc @msau42

@msau42
Copy link
Contributor

msau42 commented Nov 25, 2019

/assign @davidz627

@davidz627
Copy link
Contributor

The test failing is: Verify if offline PVC expansion works

Looks like every test requests the resize to a size 6442450944 (5Gi + 1Gi) which is actually what size the disk ended up as after the ControllerExpandVolume. For some reason in this one run the disk was resized to 6442450944 from the controller side (correctly), but then the node resize requested 6917529027641081856 (incorrectly) which wasn't done and failed validation during NodeExpandVolume.

This looks like some sort of error either in k8s or in the external-resizer that is garbling the NodeExpandVolume request capacity. PD Driver does nothing special with the request capacity and simply compares it with the device block size and prints it out in a log.

@davidz627
Copy link
Contributor

davidz627 commented Feb 4, 2020

I've splunked through the external-resizer and k8s code a bit and followed the path the requestBytes takes throughout the system. I couldn't really see where the value could have gotten modified. Do you have a repro of this test so I can try running it myself? @gnufied

@msau42
Copy link
Contributor

msau42 commented Feb 4, 2020

The test runs in OSS K8s. You can run the standard kubetest with focus on it.

@davidz627
Copy link
Contributor

VolumeResizeFailed: resize volume pvc-7a776dd5-6aa1-420f-89d7-39d53b4687ec failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
This error seems to occur in common between all the failed tests prior to failing. I inspected the code around context handling and error handling for ControllerExpandVolume and can't seem to find any obvious issues. My next step is to see if I can reproduce this error more often by running a modified resizer with a shorter context deadline.

@gnufied
Copy link
Author

gnufied commented Feb 4, 2020

I thought the failure was on node side. ControllerExpandVolume errors might be transient?

Do we know why node requested such a large number? If you let the test run in a loop with a focus, I think we should be able to reproduce this bug.

@davidz627
Copy link
Contributor

The failure is on the node side but I was investigating to see if maybe the context cancel was causing something weird to happen with the size that gets passed along to node calls (since it goes through updating the PVC fields). Couldn't see anything obviously wrong there so I'll do some runs tomorrow to see if I can repro

@davidz627
Copy link
Contributor

Ok I can repro the issue consistently by setting csiTimeout=1s for the resizer:
I get logs on the resizer during failure that look like this:

I0204 18:48:08.769881       1 connection.go:181] GRPC request: {"capacity_range":{"required_bytes":6442450944},"volume_id":"projects/dyzz-gke-dev/zones/us-central1-b/disks/pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f"}
I0204 18:48:09.770113       1 connection.go:183] GRPC response: {}
I0204 18:48:09.770814       1 connection.go:184] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0204 18:48:09.770835       1 controller.go:360] Resize volume "pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f" by resizer "pd.csi.storage.gke.io" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0204 18:48:09.770958       1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"volume-expand-8471", Name:"csi-gcepdk92xz", UID:"0ae9f066-0bfc-4f10-b5af-bf808ad8698f", APIVersion:"v1", ResourceVersion:"8853", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0204 18:48:09.776048       1 controller.go:224] Started PVC processing "volume-expand-8471/csi-gcepdk92xz"
I0204 18:48:09.783764       1 connection.go:180] GRPC call: /csi.v1.Controller/ControllerExpandVolume
I0204 18:48:09.784323       1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"volume-expand-8471", Name:"csi-gcepdk92xz", UID:"0ae9f066-0bfc-4f10-b5af-bf808ad8698f", APIVersion:"v1", ResourceVersion:"8861", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f
I0204 18:48:09.783808       1 connection.go:181] GRPC request: {"capacity_range":{"required_bytes":6442450944},"volume_id":"projects/dyzz-gke-dev/zones/us-central1-b/disks/pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f"}
I0204 18:48:10.783928       1 connection.go:183] GRPC response: {}
I0204 18:48:10.784620       1 connection.go:184] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0204 18:48:10.784639       1 controller.go:360] Resize volume "pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f" by resizer "pd.csi.storage.gke.io" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0204 18:48:10.784777       1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"volume-expand-8471", Name:"csi-gcepdk92xz", UID:"0ae9f066-0bfc-4f10-b5af-bf808ad8698f", APIVersion:"v1", ResourceVersion:"8861", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0204 18:48:10.794913       1 controller.go:224] Started PVC processing "volume-expand-8471/csi-gcepdk92xz"
I0204 18:48:10.804918       1 connection.go:180] GRPC call: /csi.v1.Controller/ControllerExpandVolume
I0204 18:48:10.806146       1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"volume-expand-8471", Name:"csi-gcepdk92xz", UID:"0ae9f066-0bfc-4f10-b5af-bf808ad8698f", APIVersion:"v1", ResourceVersion:"8865", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f
I0204 18:48:10.804942       1 connection.go:181] GRPC request: {"capacity_range":{"required_bytes":6442450944},"volume_id":"projects/dyzz-gke-dev/zones/us-central1-b/disks/pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f"}
I0204 18:48:10.942067       1 connection.go:183] GRPC response: {"capacity_bytes":6917529027641082000,"node_expansion_required":true}
I0204 18:48:10.942556       1 connection.go:184] GRPC error: <nil>
I0204 18:48:10.942567       1 controller.go:363] Resize volume succeeded for volume "pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f", start to update PV's capacity
I0204 18:48:10.953175       1 controller.go:369] Update capacity of PV "pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f" to 6Ei succeeded

Here's a particularly interesting part (after context timing out a couple times):

I0204 18:48:10.804942       1 connection.go:181] GRPC request: {"capacity_range":{"required_bytes":6442450944},"volume_id":"projects/dyzz-gke-dev/zones/us-central1-b/disks/pvc-0ae9f066-0bfc-4f10-b5af-bf808ad8698f"}
I0204 18:48:10.942067       1 connection.go:183] GRPC response: {"capacity_bytes":6917529027641082000,"node_expansion_required":true}

Looks like the driver is responding with an incorrect size

@davidz627
Copy link
Contributor

Another interesting observation is that the PD driver is expanding the disk on cloud to the correct size 6Gi not 6Ei like the response would suggest. So the resize size is correct, the response is incorrect

@davidz627
Copy link
Contributor

Ah. I got it. If we resized successfully (but timeout the context) then resize again, we fall into this if statement:


Which returns the request size in bytes, not gb

@gnufied
Copy link
Author

gnufied commented Mar 3, 2020

Is this fixed. Looks to be failing still on - https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/88488/pull-kubernetes-e2e-gce-csi-serial/1234883797321781250/

 {kubelet e2e-cbad54f779-930d0-minion-group-8p22} FailedMount: MountVolume.MountDevice failed while expanding volume for volume "
       │ pvc-7b759602-c8d0-4971-90d1-ae6c1443e95a" : MountVolume.NodeExpandVolume failed : Expander.NodeExpand failed to expand the volume : rpc error: code = Internal desc = ControllerExpandVolume resize requested for 6917529027641081856 but after resize volume was size 6442450944

@gnufied
Copy link
Author

gnufied commented Mar 3, 2020

I am tentatively reopening this. If it turns out I am wrong or missed something, we can close it.

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Mar 3, 2020
@k8s-ci-robot
Copy link
Contributor

@gnufied: Reopened this issue.

In response to this:

I am tentatively reopening this. If it turns out I am wrong or missed something, we can close it.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@davidz627
Copy link
Contributor

Released a new version of the driver v0.7.0 with the fix. It needs to be pulled in to k8s

@msau42
Copy link
Contributor

msau42 commented Apr 3, 2020

upstream specs in k/k e2e have been updated
/close

@k8s-ci-robot
Copy link
Contributor

@msau42: Closing this issue.

In response to this:

upstream specs in k/k e2e have been updated
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment