DisableDevice not working as expected #1146

saikat-royc · 2023-02-16T06:28:22Z

The target path /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state that the DisableDevice function is attempting to write at is missing.

I printed the path in my local setup PR:

I0216 06:18:40.987335       1 device-utils_linux.go:30] DisableDevice called with device path /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd device name google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd, target filepath /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state
E0216 06:18:40.987444       1 node.go:379] Failed to disabled device /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd for volume projects/saikatroyc-test/zones/us-central1-c/disks/pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd. Device may not be detached cleanly (error is ignored and unstaging is continuing): open /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state: no such file or directory

Logging into the node the actual target path is /sys/block/sdb/device/state for a scsi device for example.

saikatroyc@gke-test-pd-default-pool-bd595302-9p4h ~ $ sudo /lib/udev/scsi_id -g -d /dev/sdb
0Google  PersistentDisk  pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd
saikatroyc@gke-test-pd-default-pool-bd595302-9p4h ~ $ cat /sys/block/sdb/device/state
running

What is of interest is the /dev/sd* derived in deviceFsPath here

$ ls -l /dev/disk/by-id/* | grep pvc-843
lrwxrwxrwx 1 root root  9 Feb 16 06:20 /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd -> ../../sdb
lrwxrwxrwx 1 root root  9 Feb 16 06:20 /dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd -> ../../sdb

The text was updated successfully, but these errors were encountered:

mattcary · 2023-02-17T00:19:43Z

Ah, good catch.

k8s-triage-robot · 2023-05-18T00:43:09Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mattcary · 2023-05-18T00:57:01Z

/remove-lifecycle stale

Oops, dropped this. I have a quick PR ready.

dguendisch · 2023-06-27T14:13:08Z

We are seeing the same issue (we are on 1.9.5 atm):

2023-06-27 00:15:46 | E0627 00:15:46.846784       1 node.go:379] Failed to disabled device /dev/disk/by-id/google-pv--860c5184-2736-4ea3-9f38-d819cde18a01 for volume projects/myproject/zones/asia-south1-a/disks/pv--860c5184-2736-4ea3-9f38-d819cde18a01. Device may not be detached cleanly (error is ignored and unstaging is continuing): open /sys/block/google-pv--860c5184-2736-4ea3-9f38-d819cde18a01/device/state: no such file or directory

Seems we also don't have these kind of directories /sys/block/google... on our nodes.
However what I don't understand: I see this issue happening occasionally. Shouldn't this however always result in the volume failing to detach from GCP as these directories are never present?
(and btw, is there already a release containing #1235 ?)

mattcary · 2023-06-27T16:51:54Z

That's expected, we have not cherry-picked #1225 and have not cut a new release since it was merged.

The error message is benign. We had thought that disabling the device would prevent some race conditions, but on further investigation they don't come up. So the fact that the disabling isn't succeeding shouldn't cause any problems.

But I agree it's annoying --- k8s errors have enough noise that we shouldn't be adding to them :-)

I'll cherry pick this back to 1.9 and it will get into releases in the next week or too (we're in the middle of pushing out patch releases and I don't know if the CPs will get into this cycle or the next one---there are some CVEs that are being fixed by updating golang versions / image bases that I don't want to delay).

dguendisch · 2023-06-27T20:58:18Z

The error message is benign.

Ah, so you mean it might not be the cause of what we observe?

Aehm, I haven't actually described what we observe :)
We see occasionally volumes failing to attach to node X (where a pod using that volume is scheduled to), because the volume is actually attached to node Y, so GCP rightfully refuses to attach that volume to X.
However there is no volumeattachment resource for this particular pv that would point to Y (only the "new" volumeattachment pointing to X), hence the csi-attacher has no reason to unattach it from Y. The aforementioned error msg was the only error that I spot upon the unmounting & detaching of this pv from Y.

mattcary · 2023-06-27T21:37:00Z

Yeah, I think what you're observing is unrelated to this error message. Sorry for the noise & confusion!

What is in the node Y volumesAttached at the time you see this error (that's in the node status, you can see it with kubectl describe node)?

If there's no volumeattachment on Y, then the attacher should try to reconcile the attachment away, but only if it things that the disk is attached to the node. So there is a dance happening between the pv controller and the attacher.

Another thing to try is that if this situation comes up again, kill the csi-attacher container and let it restart. We've seen some cases with the provisioner where it looks like its informer cache gets stale --- although that seems unlikely and we don't have a consistent reproduction. But if a new csi attacher instance starts working correctly a stale cache seems less unlikely.

dguendisch · 2023-06-28T13:11:56Z

Happened again today:

If there's no volumeattachment on Y, then the attacher should try to reconcile the attachment away, but only if it things that the disk is attached to the node. So there is a dance happening between the pv controller and the attacher.

Again, no volumeattachment on Y (though the volume is still attached (according to gcloud) on Y, only one volumeattachment resource pointing to X (which of course fails due the volume being in-use).
Node Y has no mentioning of the volume in its .status.volumesAttached/volumesInUse.
Node X has the volume mentioned in .status.volumesInUse (and not in .status.volumesAttached which seems correct as it's not attached to X).
So no volumeattachment on Y (maybe with some deletionTimestamp) + no ref to that volume from node Y's .status.volumesAttached/volumesInUse => no controller will ever detach this volume from Y.

So I guess the important question is how the "old" volumeattachment on Y could have ever been successfully deleted/finalized?

Another thing to try is that if this situation comes up again, kill the csi-attacher container and let it restart.

Just tried and I guess as expected it didn't change anything as there is no hint that this volume should be detached from Y.

I'll try to dig more logs, would be happy if you had some ideas on how the old volumeattachment on Y could have vanished. Are there maybe some timeouts after which the detachment is finalized although the detachment operation never successfully finished?

mattcary · 2023-06-28T16:47:32Z

I believe the flow is: delete volumeattachment, csi-attacher notices node.volumesAttached volume w/o a volume attachment, and the detaches it.

So I think the weird part is how the volumesAttached got cleared without the volume actually being detached?

dguendisch · 2023-07-06T20:49:12Z

Sorry for the initial confusion, of course you were right that the error msg I showed was not related (after all, it came from the NodeUnstageVolume which is related to mounting/unmounting while my actual issue is rather about attaching/detaching which would translate to ControllerPublish/UnpublishVolume 😅).
I finally figured the real cause and filed a dedicated issue: #1290
Thanks for your suggestions!

pwschuurman · 2024-04-04T01:11:19Z

Disable still is not working as expected. I'm seeing the following error in logs:

Failed to disabled device /dev/disk/by-id/google-pvc-7d1b9d39-050f-4c80-9a63-961a703cff2f (aka /dev/sdb) for volume projects/psch-gke-dev/zones/us-central1-c/disks/pvc-7d1b9d39-050f-4c80-9a63-961a703cff2f. Device may not be detached cleanly (ignored, unstaging continues): %!w(*fs.PathError=&{write /sys/block/sdb/device/state 22})

pwschuurman · 2024-04-16T02:44:26Z

It seems that disabling the device does not allow subsequent NodeUnstage/NodeStage commands to work. This may require the device to be re-enabled.

Calling scsi_id on a device that has been disabled results in the following error:

scsi_id: cannot open /dev/sdh: No such device or address

Re-enabling this device by writing "running" to the device works. However there is a problem. We use scsi_id to identify the serial. The serial is used to test if this is the device we're looking for. However, we can't re-enable the device without calling scsi_id, which fails. This leads to two possible outcomes:

Disable the device through some other means
Lookup the serial some alternate way that does not require the device to be enabled.
Drop this logic, and rely on the filesystem check: Add support for checking if a device is being used by a filesystem #1658

(3) is the only approach, if (1) and (2) aren't possible.

k8s-triage-robot · 2024-07-15T03:11:31Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-14T03:36:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-09-13T03:49:33Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-09-13T03:49:37Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

saikat-royc assigned mattcary Feb 16, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2023

mattcary mentioned this issue May 18, 2023

Eval symlinks for disable device call #1228

Closed

mattcary mentioned this issue May 25, 2023

Eval symlinks for disable device call #1235

Merged

k8s-ci-robot closed this as completed in #1235 May 25, 2023

pwschuurman reopened this Apr 4, 2024

pwschuurman mentioned this issue Apr 4, 2024

Add newline character when writing offline to block device state #1659

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 14, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DisableDevice not working as expected #1146

DisableDevice not working as expected #1146

saikat-royc commented Feb 16, 2023 •

edited

Loading

mattcary commented Feb 17, 2023

k8s-triage-robot commented May 18, 2023

mattcary commented May 18, 2023

dguendisch commented Jun 27, 2023

mattcary commented Jun 27, 2023

dguendisch commented Jun 27, 2023

mattcary commented Jun 27, 2023

dguendisch commented Jun 28, 2023

mattcary commented Jun 28, 2023

dguendisch commented Jul 6, 2023

pwschuurman commented Apr 4, 2024

pwschuurman commented Apr 16, 2024

k8s-triage-robot commented Jul 15, 2024

k8s-triage-robot commented Aug 14, 2024

k8s-triage-robot commented Sep 13, 2024

k8s-ci-robot commented Sep 13, 2024

DisableDevice not working as expected #1146

DisableDevice not working as expected #1146

Comments

saikat-royc commented Feb 16, 2023 • edited Loading

mattcary commented Feb 17, 2023

k8s-triage-robot commented May 18, 2023

mattcary commented May 18, 2023

dguendisch commented Jun 27, 2023

mattcary commented Jun 27, 2023

dguendisch commented Jun 27, 2023

mattcary commented Jun 27, 2023

dguendisch commented Jun 28, 2023

mattcary commented Jun 28, 2023

dguendisch commented Jul 6, 2023

pwschuurman commented Apr 4, 2024

pwschuurman commented Apr 16, 2024

k8s-triage-robot commented Jul 15, 2024

k8s-triage-robot commented Aug 14, 2024

k8s-triage-robot commented Sep 13, 2024

k8s-ci-robot commented Sep 13, 2024

saikat-royc commented Feb 16, 2023 •

edited

Loading