Skip to content

Attach queued for node error handled incorrectly #942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mattcary opened this issue Mar 25, 2022 · 0 comments · Fixed by #943
Closed

Attach queued for node error handled incorrectly #942

mattcary opened this issue Mar 25, 2022 · 0 comments · Fixed by #943
Assignees

Comments

@mattcary
Copy link
Contributor

Thanks to @saikat-royc for finding this (and producing the details below).

Steps to repro bug (setup a force mock error in AttachDisk call to simulate failure of Disk API call):

Create a PVC and Pod.

ControllerPublish called. Mock attach error. Driver returns error to caller (external-attacher). Driver also marks the node with error .

/csi.v1.Controller/ControllerPublishVolume called with request: volume_id:"projects/saikatroyc-test/zones/us-central1-c/disks/pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c" node_id:"projects/saikatroyc-test/zones/us-central1-c/instances/gke-cluster-1-pool-1-7b0c9151-mz4q" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1648228620489-8081-pd.csi.storage.gke.io" > 

E0325 17:18:21.586455       1 utils.go:70] /csi.v1.Controller/ControllerPublishVolume returned with error: rpc error: code = Internal desc = unknown Attach error: force error attach disk

External attacher retries controllerpublish. This time driver directly queues the request as the node is marked with error. Driver returns success to the caller. External attacher thinks the attach has succeeded and marks the volumeattachment object as true.

I0325 17:18:21.594608       1 utils.go:67] /csi.v1.Controller/ControllerPublishVolume called with request: volume_id:"projects/saikatroyc-test/zones/us-central1-c/disks/pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c" node_id:"projects/saikatroyc-test/zones/us-central1-c/instances/gke-cluster-1-pool-1-7b0c9151-mz4q" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1648228620489-8081-pd.csi.storage.gke.io" > 

I0325 17:47:16.057475       1 utils.go:72] /csi.v1.Controller/ControllerPublishVolume returned with response: 


I0325 17:18:21.594695       1 controller.go:393] adding req %+v to queue volume_id:"projects/saikatroyc-test/zones/us-central1-c/disks/pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c" node_id:"projects/saikatroyc-test/zones/us-central1-c/instances/gke-cluster-1-pool-1-7b0c9151-mz4q" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1648228620489-8081-pd.csi.storage.gke.io" > 


$ k get volumeattachment
NAME                                                                   ATTACHER                PV                                         NODE                                 ATTACHED   AGE
csi-85f8cc823511c00a1ec50146e98066de9b07471ce02900aadfab3c63dfdf7c3f   pd.csi.storage.gke.io   pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c   gke-cluster-1-pool-1-7b0c9151-mz4q   true       55s

Eventually the worker process item from the queue, and starts another attempt of controllerpublish. The workitem that was queued used the context from the attach attempt of step 2. This context is already cancelled by the external attacher. Hence the driver gets context cancelled error when trying to do PD API calls with this stale context

processing workitem volume_id:"projects/saikatroyc-test/zones/us-central1-c/disks/pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c" node_id:"projects/saikatroyc-test/zones/us-central1-c/instances/gke-cluster-1-pool-1-7b0c9151-mz4q" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1648228620489-8081-pd.csi.storage.gke.io" >  from queue
I0325 17:18:21.600504       1 gce-compute.go:122] Repairing potentially underspecified volume key Key{"pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c", zone: "us-central1-c"}
I0325 17:18:21.600519       1 gce-compute.go:207] Getting disk Key{"pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c", zone: "us-central1-c"}
E0325 17:18:21.600703       1 controller.go:364] ControllerPublishVolume failed with error: rpc error: code = Internal desc = Unknown get disk error: context canceled

Now the volume lifecycle proceeds to node stage, even though the disk was not actually attached to the node and keeps failing in node stage operation

E0325 17:24:47.074472       1 utils.go:70] /csi.v1.Node/NodeStageVolume returned with error: rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c") is attached: failed to find and re-link disk pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c with udevadm after retrying for 3s: failed to trigger udevadm fix: udevadm --trigger requested to fix disk pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c but no such disk was found

/assign @mattcary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant