You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks to @saikat-royc for finding this (and producing the details below).
Steps to repro bug (setup a force mock error in AttachDisk call to simulate failure of Disk API call):
Create a PVC and Pod.
ControllerPublish called. Mock attach error. Driver returns error to caller (external-attacher). Driver also marks the node with error .
/csi.v1.Controller/ControllerPublishVolume called with request: volume_id:"projects/saikatroyc-test/zones/us-central1-c/disks/pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c" node_id:"projects/saikatroyc-test/zones/us-central1-c/instances/gke-cluster-1-pool-1-7b0c9151-mz4q" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1648228620489-8081-pd.csi.storage.gke.io" >
E0325 17:18:21.586455 1 utils.go:70] /csi.v1.Controller/ControllerPublishVolume returned with error: rpc error: code = Internal desc = unknown Attach error: force error attach disk
External attacher retries controllerpublish. This time driver directly queues the request as the node is marked with error. Driver returns success to the caller. External attacher thinks the attach has succeeded and marks the volumeattachment object as true.
I0325 17:18:21.594608 1 utils.go:67] /csi.v1.Controller/ControllerPublishVolume called with request: volume_id:"projects/saikatroyc-test/zones/us-central1-c/disks/pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c" node_id:"projects/saikatroyc-test/zones/us-central1-c/instances/gke-cluster-1-pool-1-7b0c9151-mz4q" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1648228620489-8081-pd.csi.storage.gke.io" >
I0325 17:47:16.057475 1 utils.go:72] /csi.v1.Controller/ControllerPublishVolume returned with response:
I0325 17:18:21.594695 1 controller.go:393] adding req %+v to queue volume_id:"projects/saikatroyc-test/zones/us-central1-c/disks/pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c" node_id:"projects/saikatroyc-test/zones/us-central1-c/instances/gke-cluster-1-pool-1-7b0c9151-mz4q" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1648228620489-8081-pd.csi.storage.gke.io" >
$ k get volumeattachment
NAME ATTACHER PV NODE ATTACHED AGE
csi-85f8cc823511c00a1ec50146e98066de9b07471ce02900aadfab3c63dfdf7c3f pd.csi.storage.gke.io pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c gke-cluster-1-pool-1-7b0c9151-mz4q true 55s
Eventually the worker process item from the queue, and starts another attempt of controllerpublish. The workitem that was queued used the context from the attach attempt of step 2. This context is already cancelled by the external attacher. Hence the driver gets context cancelled error when trying to do PD API calls with this stale context
Now the volume lifecycle proceeds to node stage, even though the disk was not actually attached to the node and keeps failing in node stage operation
E0325 17:24:47.074472 1 utils.go:70] /csi.v1.Node/NodeStageVolume returned with error: rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c") is attached: failed to find and re-link disk pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c with udevadm after retrying for 3s: failed to trigger udevadm fix: udevadm --trigger requested to fix disk pvc-e23d3160-7239-4fc6-a301-3e4b7d84d02c but no such disk was found
Thanks to @saikat-royc for finding this (and producing the details below).
Steps to repro bug (setup a force mock error in AttachDisk call to simulate failure of Disk API call):
Create a PVC and Pod.
ControllerPublish called. Mock attach error. Driver returns error to caller (external-attacher). Driver also marks the node with error .
External attacher retries controllerpublish. This time driver directly queues the request as the node is marked with error. Driver returns success to the caller. External attacher thinks the attach has succeeded and marks the volumeattachment object as true.
Eventually the worker process item from the queue, and starts another attempt of controllerpublish. The workitem that was queued used the context from the attach attempt of step 2. This context is already cancelled by the external attacher. Hence the driver gets context cancelled error when trying to do PD API calls with this stale context
Now the volume lifecycle proceeds to node stage, even though the disk was not actually attached to the node and keeps failing in node stage operation
/assign @mattcary
The text was updated successfully, but these errors were encountered: