Attach queued for node error handled incorrectly #942

mattcary · 2022-03-25T20:44:17Z

Thanks to @saikat-royc for finding this (and producing the details below).

Steps to repro bug (setup a force mock error in AttachDisk call to simulate failure of Disk API call):

Create a PVC and Pod.

ControllerPublish called. Mock attach error. Driver returns error to caller (external-attacher). Driver also marks the node with error .

/csi.v1.Controller/ControllerPublishVolume called with request: volume_id:"projects/saikatroyc-test/zones/us-central1-c/disks/pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c" node_id:"projects/saikatroyc-test/zones/us-central1-c/instances/gke-cluster-1-pool-1-7b0c9151-mz4q" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1648228620489-8081-pd.csi.storage.gke.io" > 

E0325 17:18:21.586455       1 utils.go:70] /csi.v1.Controller/ControllerPublishVolume returned with error: rpc error: code = Internal desc = unknown Attach error: force error attach disk

External attacher retries controllerpublish. This time driver directly queues the request as the node is marked with error. Driver returns success to the caller. External attacher thinks the attach has succeeded and marks the volumeattachment object as true.

I0325 17:18:21.594608       1 utils.go:67] /csi.v1.Controller/ControllerPublishVolume called with request: volume_id:"projects/saikatroyc-test/zones/us-central1-c/disks/pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c" node_id:"projects/saikatroyc-test/zones/us-central1-c/instances/gke-cluster-1-pool-1-7b0c9151-mz4q" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1648228620489-8081-pd.csi.storage.gke.io" > 

I0325 17:47:16.057475       1 utils.go:72] /csi.v1.Controller/ControllerPublishVolume returned with response: 


I0325 17:18:21.594695       1 controller.go:393] adding req %+v to queue volume_id:"projects/saikatroyc-test/zones/us-central1-c/disks/pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c" node_id:"projects/saikatroyc-test/zones/us-central1-c/instances/gke-cluster-1-pool-1-7b0c9151-mz4q" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1648228620489-8081-pd.csi.storage.gke.io" > 


$ k get volumeattachment
NAME                                                                   ATTACHER                PV                                         NODE                                 ATTACHED   AGE
csi-85f8cc823511c00a1ec50146e98066de9b07471ce02900aadfab3c63dfdf7c3f   pd.csi.storage.gke.io   pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c   gke-cluster-1-pool-1-7b0c9151-mz4q   true       55s

Eventually the worker process item from the queue, and starts another attempt of controllerpublish. The workitem that was queued used the context from the attach attempt of step 2. This context is already cancelled by the external attacher. Hence the driver gets context cancelled error when trying to do PD API calls with this stale context

processing workitem volume_id:"projects/saikatroyc-test/zones/us-central1-c/disks/pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c" node_id:"projects/saikatroyc-test/zones/us-central1-c/instances/gke-cluster-1-pool-1-7b0c9151-mz4q" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1648228620489-8081-pd.csi.storage.gke.io" >  from queue
I0325 17:18:21.600504       1 gce-compute.go:122] Repairing potentially underspecified volume key Key{"pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c", zone: "us-central1-c"}
I0325 17:18:21.600519       1 gce-compute.go:207] Getting disk Key{"pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c", zone: "us-central1-c"}
E0325 17:18:21.600703       1 controller.go:364] ControllerPublishVolume failed with error: rpc error: code = Internal desc = Unknown get disk error: context canceled

Now the volume lifecycle proceeds to node stage, even though the disk was not actually attached to the node and keeps failing in node stage operation

E0325 17:24:47.074472       1 utils.go:70] /csi.v1.Node/NodeStageVolume returned with error: rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c") is attached: failed to find and re-link disk pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c with udevadm after retrying for 3s: failed to trigger udevadm fix: udevadm --trigger requested to fix disk pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c but no such disk was found

/assign @mattcary

The text was updated successfully, but these errors were encountered:

k8s-ci-robot assigned mattcary Mar 25, 2022

mattcary mentioned this issue Mar 25, 2022

Return error when queued #943

Merged

mattcary closed this as completed in #943 Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attach queued for node error handled incorrectly #942

Attach queued for node error handled incorrectly #942

mattcary commented Mar 25, 2022

Attach queued for node error handled incorrectly #942

Attach queued for node error handled incorrectly #942

Comments

mattcary commented Mar 25, 2022