Skip to content

DisableDevice not working as expected #1146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
saikat-royc opened this issue Feb 16, 2023 · 16 comments · Fixed by #1235
Closed

DisableDevice not working as expected #1146

saikat-royc opened this issue Feb 16, 2023 · 16 comments · Fixed by #1235
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@saikat-royc
Copy link
Member

saikat-royc commented Feb 16, 2023

The target path /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state that the DisableDevice function is attempting to write at is missing.

I printed the path in my local setup PR:

I0216 06:18:40.987335       1 device-utils_linux.go:30] DisableDevice called with device path /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd device name google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd, target filepath /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state
E0216 06:18:40.987444       1 node.go:379] Failed to disabled device /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd for volume projects/saikatroyc-test/zones/us-central1-c/disks/pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd. Device may not be detached cleanly (error is ignored and unstaging is continuing): open /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state: no such file or directory

Logging into the node the actual target path is /sys/block/sdb/device/state for a scsi device for example.

saikatroyc@gke-test-pd-default-pool-bd595302-9p4h ~ $ sudo /lib/udev/scsi_id -g -d /dev/sdb
0Google  PersistentDisk  pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd
saikatroyc@gke-test-pd-default-pool-bd595302-9p4h ~ $ cat /sys/block/sdb/device/state
running

What is of interest is the /dev/sd* derived in deviceFsPath here

$ ls -l /dev/disk/by-id/* | grep pvc-843
lrwxrwxrwx 1 root root  9 Feb 16 06:20 /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd -> ../../sdb
lrwxrwxrwx 1 root root  9 Feb 16 06:20 /dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd -> ../../sdb
@mattcary
Copy link
Contributor

Ah, good catch.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2023
@mattcary
Copy link
Contributor

/remove-lifecycle stale

Oops, dropped this. I have a quick PR ready.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2023
@dguendisch
Copy link

We are seeing the same issue (we are on 1.9.5 atm):

2023-06-27 00:15:46 | E0627 00:15:46.846784       1 node.go:379] Failed to disabled device /dev/disk/by-id/google-pv--860c5184-2736-4ea3-9f38-d819cde18a01 for volume projects/myproject/zones/asia-south1-a/disks/pv--860c5184-2736-4ea3-9f38-d819cde18a01. Device may not be detached cleanly (error is ignored and unstaging is continuing): open /sys/block/google-pv--860c5184-2736-4ea3-9f38-d819cde18a01/device/state: no such file or directory

Seems we also don't have these kind of directories /sys/block/google... on our nodes.
However what I don't understand: I see this issue happening occasionally. Shouldn't this however always result in the volume failing to detach from GCP as these directories are never present?
(and btw, is there already a release containing #1235 ?)

@mattcary
Copy link
Contributor

That's expected, we have not cherry-picked #1225 and have not cut a new release since it was merged.

The error message is benign. We had thought that disabling the device would prevent some race conditions, but on further investigation they don't come up. So the fact that the disabling isn't succeeding shouldn't cause any problems.

But I agree it's annoying --- k8s errors have enough noise that we shouldn't be adding to them :-)

I'll cherry pick this back to 1.9 and it will get into releases in the next week or too (we're in the middle of pushing out patch releases and I don't know if the CPs will get into this cycle or the next one---there are some CVEs that are being fixed by updating golang versions / image bases that I don't want to delay).

@dguendisch
Copy link

The error message is benign.

Ah, so you mean it might not be the cause of what we observe?

Aehm, I haven't actually described what we observe :)
We see occasionally volumes failing to attach to node X (where a pod using that volume is scheduled to), because the volume is actually attached to node Y, so GCP rightfully refuses to attach that volume to X.
However there is no volumeattachment resource for this particular pv that would point to Y (only the "new" volumeattachment pointing to X), hence the csi-attacher has no reason to unattach it from Y. The aforementioned error msg was the only error that I spot upon the unmounting & detaching of this pv from Y.

@mattcary
Copy link
Contributor

Yeah, I think what you're observing is unrelated to this error message. Sorry for the noise & confusion!

What is in the node Y volumesAttached at the time you see this error (that's in the node status, you can see it with kubectl describe node)?

If there's no volumeattachment on Y, then the attacher should try to reconcile the attachment away, but only if it things that the disk is attached to the node. So there is a dance happening between the pv controller and the attacher.

Another thing to try is that if this situation comes up again, kill the csi-attacher container and let it restart. We've seen some cases with the provisioner where it looks like its informer cache gets stale --- although that seems unlikely and we don't have a consistent reproduction. But if a new csi attacher instance starts working correctly a stale cache seems less unlikely.

@dguendisch
Copy link

Happened again today:

If there's no volumeattachment on Y, then the attacher should try to reconcile the attachment away, but only if it things that the disk is attached to the node. So there is a dance happening between the pv controller and the attacher.

Again, no volumeattachment on Y (though the volume is still attached (according to gcloud) on Y, only one volumeattachment resource pointing to X (which of course fails due the volume being in-use).
Node Y has no mentioning of the volume in its .status.volumesAttached/volumesInUse.
Node X has the volume mentioned in .status.volumesInUse (and not in .status.volumesAttached which seems correct as it's not attached to X).
So no volumeattachment on Y (maybe with some deletionTimestamp) + no ref to that volume from node Y's .status.volumesAttached/volumesInUse => no controller will ever detach this volume from Y.

So I guess the important question is how the "old" volumeattachment on Y could have ever been successfully deleted/finalized?

Another thing to try is that if this situation comes up again, kill the csi-attacher container and let it restart.

Just tried and I guess as expected it didn't change anything as there is no hint that this volume should be detached from Y.

I'll try to dig more logs, would be happy if you had some ideas on how the old volumeattachment on Y could have vanished. Are there maybe some timeouts after which the detachment is finalized although the detachment operation never successfully finished?

@mattcary
Copy link
Contributor

I believe the flow is: delete volumeattachment, csi-attacher notices node.volumesAttached volume w/o a volume attachment, and the detaches it.

So I think the weird part is how the volumesAttached got cleared without the volume actually being detached?

@dguendisch
Copy link

Sorry for the initial confusion, of course you were right that the error msg I showed was not related (after all, it came from the NodeUnstageVolume which is related to mounting/unmounting while my actual issue is rather about attaching/detaching which would translate to ControllerPublish/UnpublishVolume 😅).
I finally figured the real cause and filed a dedicated issue: #1290
Thanks for your suggestions!

@pwschuurman
Copy link
Contributor

Disable still is not working as expected. I'm seeing the following error in logs:

Failed to disabled device /dev/disk/by-id/google-pvc-7d1b9d39-050f-4c80-9a63-961a703cff2f (aka /dev/sdb) for volume projects/psch-gke-dev/zones/us-central1-c/disks/pvc-7d1b9d39-050f-4c80-9a63-961a703cff2f. Device may not be detached cleanly (ignored, unstaging continues): %!w(*fs.PathError=&{write /sys/block/sdb/device/state 22})

@pwschuurman
Copy link
Contributor

It seems that disabling the device does not allow subsequent NodeUnstage/NodeStage commands to work. This may require the device to be re-enabled.

Calling scsi_id on a device that has been disabled results in the following error:

scsi_id: cannot open /dev/sdh: No such device or address

Re-enabling this device by writing "running" to the device works. However there is a problem. We use scsi_id to identify the serial. The serial is used to test if this is the device we're looking for. However, we can't re-enable the device without calling scsi_id, which fails. This leads to two possible outcomes:

  1. Disable the device through some other means
  2. Lookup the serial some alternate way that does not require the device to be enabled.
  3. Drop this logic, and rely on the filesystem check: Add support for checking if a device is being used by a filesystem #1658

(3) is the only approach, if (1) and (2) aren't possible.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 14, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 13, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment