Disk failed to relink with udevadm #608

jpds · 2020-09-15T19:33:02Z

I've been running a database cluster with a statefulset for a number of months of GKE. Yesterday I upgraded to to regular channel version 1.16.13-gke.401 and today I found that that statefulset had its first pod with this in describe pod:

Events:
  Type     Reason       Age                   From                                                          Message
  ----     ------       ----                  ----                                                          -------
  Warning  FailedMount  29m (x19 over 97m)    kubelet, gke-cluster-f51460e-nodes-preemptible-n-593b5f7a-rej6  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data stolon-secrets stolon-token-xxxxx]: timed out waiting for the condition
  Warning  FailedMount  9m15s (x45 over 98m)  kubelet, gke-cluster-f51460e-nodes-preemptible-n-593b5f7a-rej6  MountVolume.MountDevice failed for volume "pvc-2e9a43cc-3655-4d64-af9e-9b4643b629a2" : rpc error: code = Internal desc = Error when getting device path: error verifying GCE PD ("pvc-2e9a43cc-3655-4d64-af9e-9b4643b629a2") is attached: failed to find and re-link disk pvc-2e9a43cc-3655-4d64-af9e-9b4643b629a2 with udevadm after retrying for 3s: failed to trigger udevadm fix: udevadm --trigger requested to fix disk pvc-2e9a43cc-3655-4d64-af9e-9b4643b629a2 but no such disk was found
  Warning  FailedMount  4m10s (x9 over 90m)   kubelet, gke-cluster-f51460e-nodes-preemptible-n-593b5f7a-rej6  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[stolon-secrets stolon-token-xxxxx data]: timed out waiting for the condition

The text was updated successfully, but these errors were encountered:

mattcary · 2020-09-15T19:37:06Z

Can you confirm that the PD behind the associated PV/PVC is healthy?

jpds · 2020-09-15T21:49:43Z

It is indeed, I did kubectl delete pod db-0 and the database cluster came back after that.

mattcary · 2020-09-15T21:53:03Z

Weird, so pod 0 must have gotten in to a weird state? Possibly the attach failed but the kubelet couldn't figure it out, and deleting the pod forced a retry.

Maybe just one of those things, but I wonder if there is a race condition somewhere we're missing.

msau42 · 2020-09-15T22:32:35Z

By the time we get to the MountDevice error, I believe the Attach call should have succeded.

We have seem some strange udev problems in the past where udev missed an event or didn't update its device cache. We've tried to workaround it by doing a "udev trigger" when things seem strange, but there's potentially still some scenarios where that won't help.

jpds · 2020-09-16T21:16:49Z

This just failed again, there's definitely a regression between 1.16.3-gke.1 and .401.

msau42 · 2020-09-18T23:52:26Z

Do you mean 1.16.13-gke.1 and .401? I don't think we ever released a 1.16.3-gke.1. If so, both those versions use v0.7.0-gke.0 of the driver.

If you meant 1.15.13-gke.1, then the pdcsi driver was on v0.6.0-gke.0, and potentially #459 could be playing a factor.

Is this issue reliably reproducible? Are you able to ssh to the node and inspect the local path to the PD and what its udev link and scsi 83 page says?

jpds · 2020-09-25T21:55:51Z

Indeed, 1.16.13-gke.1 and .401 - sorry.

I upgraded the cluster to regular-channel/1.17.9-gke.1504 last Friday and it seems to have not had this issue so far.

msau42 · 2020-09-25T22:44:33Z

1.17.9-gke.1504 uses v0.7.0-gke.0 of the pd driver.

Given all the versions you used had the same driver version that includes #459, maybe it could be an OS issue. cc @mattcary

Jiawei0227 · 2020-10-05T18:28:47Z

Also hit the same issue. One thing interesting is I found my CSIDriver object is misconfigured.

$ kubectl get csidriver                                                             
NAME                    ATTACHREQUIRED   PODINFOONMOUNT   MODES        AGE
pd.csi.storage.gke.io   false            false            Persistent   6d20h

The attachrequired field should set to true.

fejta-bot · 2021-01-03T19:10:54Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-02-02T19:54:14Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

anemade · 2021-02-17T08:56:14Z

I am facing the same issue I believe.
I am using server with

Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.14-gke.1600", GitCommit:"44dfef46993969762b5df37fe3dda47621476708", GitTreeState:"clean", BuildDate:"2021-01-12T09:20:19Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}

Events:
  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Normal   Scheduled               2m20s               default-scheduler        Successfully assigned pre-1361943147524919296/prof-tbyg-nginx-rc-np5sz to gke-ci-cluster-default-pool-c344e6a7-3pb1
  Normal   SuccessfulAttachVolume  2m7s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-c3d4cc67-c3da-47a7-8eb7-3f3959b32afb"
  Warning  FailedMount             56s (x8 over 2m4s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-c3d4cc67-c3da-47a7-8eb7-3f3959b32afb" : rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-c3d4cc67-c3da-47a7-8eb7-3f3959b32afb") is attached: failed to find and re-link disk pvc-c3d4cc67-c3da-47a7-8eb7-3f3959b32afb with udevadm after retrying for 3s: failed to trigger udevadm fix: failed to get SCSI Serial num: scsi_id failed for device "/dev/sdb" with output : exit status 1
  Warning  FailedMount             17s                 kubelet                  Unable to attach or mount volumes: unmounted volumes=[nginx-rc-volume], unattached volumes=[default-token-rzz2g nginx-rc-volume]: timed out waiting for the condition

I have enabled the CSI using this. It's using the csi driver v1.0.1-gke.0

mattcary · 2021-02-17T18:37:56Z

/lifecycle frozen

madhurnawandar · 2021-08-04T05:57:04Z

Any update on this issue. This is easily reproducible and hitting us frequently

mattcary · 2021-08-05T21:19:14Z

@madhurnawandar can you post steps to get a repro here?

umeshkumhar · 2021-08-11T04:56:31Z

@mattcary Hey
This issue is coming as an intermittent issue and we are able to regenerate the issue, when we install 4-5 applications with PVCs in single namespace then one of the application always break with volume mount issue with below error:

Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-8419ba38-f017-4b09-a17c-675377299ceb") is attached: failed to find and re-link disk pvc-8419ba38-f017-4b09-a17c-675377299ceb with udevadm after retrying for 3s: failed to trigger udevadm fix: failed to get SCSI Serial num: scsi_id failed for device "/dev/sdb" with output : exit status 1

We have tried upgrading gce-pd-csidriver to the latest stable and still the issuer is the same.
Seems the relevant code is

gcp-compute-persistent-disk-csi-driver/pkg/mount-manager/device-utils.go

Line 121 in 82070d8

    
           return "", fmt.Errorf("scsi_id failed for device %q with output %s: %v", devicePath, string(out), err)

Any clues?

mattcary · 2021-08-11T15:52:41Z

@umeshkumhar We haven't had a chance to reproduce it. Are you able to come up with even an intermittently reproducing test case?

umeshkumhar · 2021-08-25T07:11:33Z

@mattcary We just install helm chart for MongoDB, Redis and MySQL with PVCs in a single namespace, At random any PVC will fail and block out ginkgo test execution,

It somethings works fine, that is 1 in 20 execution.

We also tried re-creating new GKE cluster with 1.18.20-gke.901

Please suggest something, its a blocker for us.

ivanbenko · 2021-09-24T06:45:59Z

Looks like I'm having the same issue

MountVolume.MountDevice failed for volume "pvc-a851fe88-219b-442c-a52c-50848fda18af" : rpc error: code = Internal desc = Error when getting device path: error verifying GCE PD ("pvc-a851fe88-219b-442c-a52c-50848fda18af") is attached: failed to find and re-link disk pvc-a851fe88-219b-442c-a52c-50848fda18af with udevadm after retrying for 3s: failed to trigger udevadm fix: failed to get SCSI Serial num: scsi_id failed for device "/dev/sdaa" with output : exit status 1

duythinht · 2022-03-09T04:42:15Z

I'm facing this issue on pre-emptible node pool, whever a node was preempt, the pods status is changing to ContainerCreating and stuck on mount device

Warning  FailedMount  4m51s (x167 over 5h29m)  kubelet  MountVolume.MountDevice failed for volume "pvc-5628bf1d-78c7-4819-8391-6e76a841234a" : rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-5628bf1d-78c7-4819-8391-6e76a841234a") is attached: failed to find and re-link disk pvc-5628bf1d-78c7-4819-8391-6e76a841234a with udevadm after retrying for 3s: failed to trigger udevadm fix: udevadm --trigger requested to fix disk pvc-5628bf1d-78c7-4819-8391-6e76a841234a but no such disk was found

any update yet on this issue?

mattcary · 2022-03-10T02:26:43Z

We are actively looking into several related issues. Do you have a reliable reproduction recipie? Thx.

lapcchan · 2022-03-21T09:43:55Z

i have the same problem on gke, it usually happens on preempted node with statefulset pvc template. it seems to be related to the node and i can always fix it by draining the node where the pod scheduled on.

cristicbz · 2022-04-06T12:51:27Z

We are also hitting this with pre-emptible node pools, post pre-emption. GKE 1.22

Edit: Deleting the pods would get them stuck in terminating, deleteting with --force, removed them but on recreation they would fail the same way. Only way to fix this for us was to manually delete the instances and scale the node pool back up, to recreate the ndoes.

mattcary · 2022-04-06T15:29:40Z

We have seen some cases where the disk is attached to the host machine, but a deadlock is happening in the guest OS that is preventing the device from appearing in sysfs (which then explains the error messages above, where the device can't be found).

However we have not been able to reproduce this in order to figure out the root cause. So any further clues you can give would be helpful.

The first thing is to identify if this is the same problem: does the PD fail to appear in /dev/disk/by-id? Is the device missing from lsscsi or /sys/class/scsi_disk? (it's hard to positively identify a missing disk, all I've been able to do is count the number of scsi devices and see if it matches what is supposed to be attached to the VM, eg from gcloud compute instances describe).

If so, look in dmesg for something like "virtio_scsi: SCSI device 0 0 176 0 not found". Look for stuck kworker threads (ie, kworker threads in state "D" in ps). If there's such a stuck thread, look in /proc//stack for something like "INFO: task kworker/6:9:2470111 blocked for more than 327 seconds" and long stack traces including __scsi_add_disk or __scsi_remove_disk.

If so, it's a similar problem, and we'd appreciate more information on how this is happening, including your machine shape, number of attached devices, number of pods on the node, etc.

pdfrod · 2022-05-06T13:51:06Z

Alright, I'm upgrading from 1.22.6-gke.200 to 1.22.8-gke.200 (which already includes #951 from what I can see). I'll report back if the problem happens again.

Mokto · 2022-05-07T09:04:29Z

Same issue on 1.22.8-gke.200.

Edit: same on
1.23.5-gke.1501

brancz · 2022-05-09T08:40:59Z

We've been running into this for a few months now as well (currently on 1.22.7). We'll upgrade and report back again, but happy to investigate together on our cluster(s) if that's helpful!

Mokto · 2022-05-16T09:49:37Z

This issue is super annoying and I hope it will be fixed soon, but in the meantime, you can use this piece I code I've written:

package main

import (
	"context"
	"fmt"
	"strings"
	"time"

	v1 "k8s.io/api/core/v1"
	storage "k8s.io/api/storage/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	types "k8s.io/apimachinery/pkg/types"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
)

func check(e error) {
	if e != nil {
		panic(e)
	}
}

func main() {
	config, err := rest.InClusterConfig()
	check(err)
	// creates the clientset
	clientset, err := kubernetes.NewForConfig(config)
	check(err)
	for {
		volumes, err := clientset.StorageV1().VolumeAttachments().List(context.TODO(), metav1.ListOptions{})
		check(err)
		fmt.Println(len(volumes.Items), "volumes")
		persistentVolumeClaims, err := clientset.CoreV1().PersistentVolumeClaims("").List(context.TODO(), metav1.ListOptions{})
		check(err)
		fmt.Println(len(persistentVolumeClaims.Items), "persistentVolumeClaims")

		pods, err := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{})
		check(err)
		fmt.Printf("There are %d pods in the cluster\n", len(pods.Items))

		for _, pod := range pods.Items {
			if pod.Status.Phase == "Pending" && time.Now().After(pod.CreationTimestamp.Add(10*time.Minute)) {
				fmt.Printf("Pod name: %s, Namespace: %s, Status: %s\n for more than 10min", pod.Name, pod.Namespace, pod.Status.Phase)

				var volumeClaimFound v1.PersistentVolumeClaim
				for _, volumeclaim := range persistentVolumeClaims.Items {
					if volumeclaim.Namespace == pod.Namespace && strings.Contains(volumeclaim.Name, pod.Name) {
						volumeClaimFound = volumeclaim
						break
					}
				}
				fmt.Println(volumeClaimFound.Spec.VolumeName, "Found")

				var volumeAttachmentFound storage.VolumeAttachment
				for _, volumeAttachment := range volumes.Items {
					if volumeAttachment.Spec.Source.PersistentVolumeName != nil && *volumeAttachment.Spec.Source.PersistentVolumeName == volumeClaimFound.Spec.VolumeName {
						volumeAttachmentFound = volumeAttachment
						break
					}
				}

				fmt.Println(volumeAttachmentFound.Name, "Found")

				_, err := clientset.StorageV1().VolumeAttachments().Patch(
					context.Background(),
					volumeAttachmentFound.Name,
					types.JSONPatchType,
					[]byte(`[{"op": "remove", "path": "/metadata/finalizers"}]`),
					metav1.PatchOptions{},
				)
				check(err)

				err = clientset.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{})
				check(err)

			}
		}
		time.Sleep(10 * time.Minute)
	}
}

When deployed as a container in Kubernetes, every 10min, it :

checks if any pod has been pending for more than 10min
if so finalizes the attached volume
and then restart the pod

Hope it helps.

mattcary · 2022-05-17T15:42:01Z

I think the several recent comments should actually refer to #987.

While I appreciate the feedback from the community, it would be helpful to read the contents of the bug and try to be accurate about where to post. It's not very helpful if we spend more time figuring out which comments are attached to which issue than working on the problem.

ialidzhikov · 2022-05-28T17:09:37Z

/kind bug

jenting · 2022-11-11T03:48:23Z

If so, look in dmesg for something like "virtio_scsi: SCSI device 0 0 176 0 not found". Look for stuck kworker threads (ie, kworker threads in state "D" in ps). If there's such a stuck thread, look in /proc//stack for something like "INFO: task kworker/6:9:2470111 blocked for more than 327 seconds" and long stack traces including __scsi_add_disk or __scsi_remove_disk.

If so, it's a similar problem, and we'd appreciate more information on how this is happening, including your machine shape, number of attached devices, number of pods on the node, etc.

@mattcary
We check the dmesg and our error is the same as the above description. We are using the v1.8.0 GCP PD CSI driver version.

We found that this error message is shown first.

Failed to disabled device /dev/disk/by-id/google-pvc-8a92f4f4-e9dc-4e20-948c-6f4fea8f08aa for volume projects/workspace-clusters/zones/us-west1-b/disks/pvc-8a92f4f4-e9dc-4e20-948c-6f4fea8f08aa. Device may not be detached cleanly (error is ignored and unstaging is continuing): open /sys/block/google-pvc-8a92f4f4-e9dc-4e20-948c-6f4fea8f08aa/device/state: no such file or directory

And then the following new PVC can't be mounted successfully.

 Warning  FailedMount             21m (x25 over 56m)  kubelet                  MountVolume.MountDevice failed for volume "pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c" : rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c") is attached: failed to find and re-link disk pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c with udevadm after retrying for 3s: failed to trigger udevadm fix of non existent disk for "pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c": udevadm --trigger requested to fix disk pvc-b5d1cab4-5261-4ad0-8fc3-fb87f773558c but no such disk was found
 Warning  FailedMount             5m (x22 over 54m)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[vol-this-workspace], unattached volumes=[vol-this-workspace daemon-mount]: timed out waiting for the condition
 Warning  FailedMount             54s (x5 over 23m)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[vol-this-workspace], unattached volumes=[daemon-mount vol-this-workspace]: timed out waiting for the condition

We think the problem is the VM cannot disconnect the disk. We find that

The kernel messages indicating timeouts related to virtio_scsi
The kernel workqueue still has in-flight operations related to virtio_scsi.
Q: Is it a platform issue? Can the CSI driver able to fix it?

[  605.518214] INFO: task kworker/8:34:16931 blocked for more than 241 seconds.
[  605.526795]       Not tainted 5.15.0-47-generic #51-Ubuntu
[  605.533775] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.543112] task:kworker/8:34    state:D stack:    0 pid:16931 ppid:     2 flags:0x00004000
[  605.543114] Workqueue: events_freezable virtscsi_handle_event [virtio_scsi]
[  605.543117] Call Trace:
[  605.543118]  <TASK>
[  605.543119]  __schedule+0x23d/0x590
[  605.543121]  ? class_dir_release+0xe/0x20
[  605.543124]  schedule+0x4e/0xc0
[  605.543126]  blk_mq_freeze_queue_wait+0x6d/0xa0
[  605.543128]  ? wait_woken+0x70/0x70
[  605.543131]  del_gendisk+0x1c0/0x270
[  605.543133]  sd_remove+0x45/0x90
[  605.543135]  __device_release_driver+0x287/0x2a0
[  605.543137]  device_release_driver+0x29/0x40
[  605.543138]  bus_remove_device+0xde/0x150
[  605.543140]  device_del+0x19c/0x400
[  605.543141]  __scsi_remove_device+0x12c/0x170
[  605.543143]  scsi_remove_device+0x27/0x40
[  605.543145]  virtscsi_handle_event+0x133/0x2b0 [virtio_scsi]
[  605.543146]  process_one_work+0x22b/0x3d0
[  605.543148]  worker_thread+0x53/0x420
[  605.543149]  ? process_one_work+0x3d0/0x3d0
[  605.543150]  kthread+0x12a/0x150
[  605.543152]  ? set_kthread_struct+0x50/0x50
[  605.543154]  ret_from_fork+0x22/0x30
[  605.543156]  </TASK>

---

[ 1881.223689] Showing busy workqueues and worker pools:
[ 1881.223691] workqueue events_freezable: flags=0x4
[ 1881.223692]   pwq 16: cpus=8 node=0 flags=0x0 nice=0 active=8/256 refcnt=9
[ 1881.223695]     in-flight: 16931:virtscsi_handle_event [virtio_scsi], 16944:virtscsi_handle_event [virtio_scsi], 16933:virtscsi_handle_event [virtio_scsi], 16961:virtscsi_handle_event [virtio_scsi], 17488:virtscsi_handle_event [virtio_scsi], 16907:virtscsi_handle_event [virtio_scsi], 17527:virtscsi_handle_event [virtio_scsi], 16938:virtscsi_handle_event [virtio_scsi]
[ 1881.223740] pool 16: cpus=8 node=0 flags=0x0 nice=0 hung=0s workers=11 idle: 17451 16904 17531

We create 11 pods on a node, and for these 11 pods, each pod mounts a PVC (with GCP standard PD, Size 30Gi)
For each pod, we run the cpu+memory+fio test.

# CPU test
stress-ng --cpu 3 --backoff 10000000 --timeout 600s

# memory test
stress-ng --vm 1 --vm-keep --vm-bytes 6G

# fio test
fio --name io-stress --eta-newline=5s --filename=/tmp/fio.temp --rw=randrw --size=2g --io_size=50g --blocksize=8k --ioengine=libaio --fsync=1000 --iodepth=10 --direct=1 --numjobs=20 --runtime=600

These 11 pods are running for two minutes, then removing these 11 pods and recreating the new 11 pods.
After several rounds, this issue has a chance to be reproducible.

mattcary · 2022-11-11T17:05:19Z

Thanks, that confirms what we have found internally. GCE is pushing out a fix for a problem where if a PD is detached while there is in-flight IO, the node becomes poisioned and can no longer attach any PDs. The only solution is to delete the node.

The fix for this is being rolled out, but the process is slow and will not be across the whole fleet until the beginning of next year if I understand correctly.

msau42 · 2022-11-11T18:08:27Z

The other half of the issue is why the PD was not able to be cleanly unmounted before detach. kubernetes/kubernetes#110721 may help avoid that, but it seems like there may still be an issue of container teardown or unmount taking too long.

jenting · 2022-11-11T22:42:04Z

Thank you @mattcary

The only solution is to delete the node.

This is sad.

The fix for this is being rolled out, but the process is slow and will not be across the whole fleet until the beginning of next year if I understand correctly.

Where can we know the roll out plan?
Or can we know which GCE regions have fix this issue?

mattcary · 2022-11-11T22:49:13Z

This is sad.

Yeah :-/

The fix for this is being rolled out, but the process is slow and will not be across the whole fleet until the beginning of next year if I understand correctly.

Where can we know the roll out plan? Or can we know which GCE regions have fix this issue?

If I understand correctly, the fix is currently in experiment now, which means that random VMs are getting the fix. There is no public tracking of this fix that I know of unfortunately.

kylos101 · 2022-11-14T13:31:35Z

Hi @mattcary , are you aware of a workaround we might be able to use until the experimental fix is GA? 🙏 It's pretty common for us to run into this issue, because we often detach and attach new PVCs to existing nodes (one PVC for each workload).

mattcary · 2022-11-14T17:04:53Z

Hi @mattcary , are you aware of a workaround we might be able to use until the experimental fix is GA? 🙏 It's pretty common for us to run into this issue, because we often detach and attach new PVCs to existing nodes (one PVC for each workload).

You should only be running into this problem if you're detaching disks while IO is in flight. K8s doesn't do that normally---it waits for a clean unmount---and so it's only in a very particular situation where this happens (queuing or other behavior on the kubelet leading to the 6m force detach flow). I'm trying to be precise here about the distinction from detach and unmount. Unmounting while IO is in flight won't case this problem; it may cause problems for the filesystem but from the block device level once the system unmounts there won't be IO and so detach is safe. Hence sipmly terminating pods doesn't cause this error.

Attaching and detaching PVCs to existing nodes will not normally cause this behavior. So that may mean something else is going on---in which case this fix isn't going to help you.

To reiterate, you're only hitting this problem if you're detaching disks (either manually, or because of nodes getting into an unusually overloaded state) while there is active IO on the device. Normal PVC workflows do not cause this behavior as far as we've seen.

Of course you may have found a new corner case, so please post details of your problems. But I suspect that this is not your issue.

jenting · 2022-11-17T07:44:40Z

To reiterate, you're only hitting this problem if you're detaching disks (either manually, or because of nodes getting into an unusually overloaded state) while there is active IO on the device. Normal PVC workflows do not cause this behavior as far as we've seen.

@mattcary
It's interesting. Because of the above community feedback, we did not see any reports mentioning manually detaching the disk.

How can we identify the node that goes into the unusually overloaded state? Will the node state change to NotReady because the node has disk pressure?

Note: we monitored the node state, but none ever went into the NotReady state when we performed the testing.

This is being done as a temporary workaround for kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#608 that we're constantly seeing in the CI environment in GCP Signed-off-by: Oren Cohen <[email protected]>

mattcary · 2022-11-17T20:38:45Z

How can we identify the node that goes into the unusually overloaded state? Will the node state change to NotReady because the node has disk pressure?

Note: we monitored the node state, but none ever went into the NotReady state when we performed the testing.

It hasn't shown up in the kubelet monitored bits. The thing to look for is in the attacher controller logs in the kube-controller for lots of attach/detach operations, maybe taking a long time (> 30s), and the attacher controller doing a force detach.

mattcary · 2023-05-15T15:36:30Z

/close

The poisoned node fix is rolled out.

We have seen sporadic reports of other device related problems, but they seem to not be related to this bug so it would be better to start a new issue for them.

k8s-ci-robot · 2023-05-15T15:36:35Z

@mattcary: Closing this issue.

In response to this:

/close

The poisoned node fix is rolled out.

We have seen sporadic reports of other device related problems, but they seem to not be related to this bug so it would be better to start a new issue for them.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MalteMagnussen · 2025-02-05T15:45:49Z

i have the same problem on gke, it usually happens on preempted node with statefulset pvc template. it seems to be related to the node and i can always fix it by draining the node where the pod scheduled on.

That solved it. Thanks @lapcchan

I cordoned the node the pod that was using the PVC was on.
Then I deleted the pod.
GKE booted up a new node and put the pod on it.
Then I uncordoned the original node, and the pod was MOVED BACK to that node, but is now attaching the disk just fine?? Very strange.
GKE deleted the extra node it booted up.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 2, 2021

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Feb 17, 2021

ulysseskan mentioned this issue Apr 15, 2022

7.0.36 unable to mount volumes on Google Cloud gravitational/gravity#2727

Closed

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 28, 2022

jenting mentioned this issue Sep 27, 2022

[PVC] loadgen testing Pod can't mount Volume gitpod-io/gitpod#13353

Closed

orenc1 mentioned this issue Nov 17, 2022

[CI] Add HPP for HyperShift/Kubevirt deployments openshift/hypershift#1874

Merged

4 tasks

wajihyassine mentioned this issue Jan 31, 2023

Virtio deadlock causing failure to attach persistent disks to VM in high loads google/turbinia#1249

Closed

mattcary mentioned this issue Feb 3, 2023

MountDevice doesn't recover with temp failures #1123

Closed

eranco74 mentioned this issue Feb 13, 2023

MGMT-13612: Add confidential computing workflow openshift/release#36216

Closed

k8s-ci-robot closed this as completed May 15, 2023

devoxel mentioned this issue Jul 5, 2023

pdcsi 0.14.9 - failed to find and re-link disk with udevadm #1289

Closed

Disk failed to relink with udevadm #608

Disk failed to relink with udevadm #608

Comments

jpds commented Sep 15, 2020

mattcary commented Sep 15, 2020

jpds commented Sep 15, 2020

mattcary commented Sep 15, 2020

msau42 commented Sep 15, 2020

jpds commented Sep 16, 2020

msau42 commented Sep 18, 2020

jpds commented Sep 25, 2020

msau42 commented Sep 25, 2020

Jiawei0227 commented Oct 5, 2020 • edited Loading

fejta-bot commented Jan 3, 2021

fejta-bot commented Feb 2, 2021

anemade commented Feb 17, 2021 • edited Loading

mattcary commented Feb 17, 2021

madhurnawandar commented Aug 4, 2021

mattcary commented Aug 5, 2021

umeshkumhar commented Aug 11, 2021

mattcary commented Aug 11, 2021

umeshkumhar commented Aug 25, 2021

ivanbenko commented Sep 24, 2021

duythinht commented Mar 9, 2022

mattcary commented Mar 10, 2022

lapcchan commented Mar 21, 2022 • edited Loading

cristicbz commented Apr 6, 2022 • edited Loading

mattcary commented Apr 6, 2022

pdfrod commented May 6, 2022

Mokto commented May 7, 2022 • edited Loading

brancz commented May 9, 2022

Mokto commented May 16, 2022

mattcary commented May 17, 2022

ialidzhikov commented May 28, 2022

jenting commented Nov 11, 2022 • edited Loading

mattcary commented Nov 11, 2022

msau42 commented Nov 11, 2022

jenting commented Nov 11, 2022

mattcary commented Nov 11, 2022

kylos101 commented Nov 14, 2022

mattcary commented Nov 14, 2022

jenting commented Nov 17, 2022 • edited Loading

mattcary commented Nov 17, 2022

mattcary commented May 15, 2023

k8s-ci-robot commented May 15, 2023

MalteMagnussen commented Feb 5, 2025 • edited Loading

Jiawei0227 commented Oct 5, 2020 •

edited

Loading

anemade commented Feb 17, 2021 •

edited

Loading

lapcchan commented Mar 21, 2022 •

edited

Loading

cristicbz commented Apr 6, 2022 •

edited

Loading

Mokto commented May 7, 2022 •

edited

Loading

jenting commented Nov 11, 2022 •

edited

Loading

jenting commented Nov 17, 2022 •

edited

Loading

MalteMagnussen commented Feb 5, 2025 •

edited

Loading